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ABSTRACT 


U.S. Marine Corps logisticians and operational planners must simultaneously plan for the 
sustainment of current operations while planning for future operations. Currently, this 
process is hindered by the manual correlation of force consumption data from electronic 
and hardcopy documents. 

In order to refine this process, this thesis presents a process for converting, 
analyzing, and storing these documents in an electronic format. In order to aid in the 
conversion process, three optical character recognition (OCR) applications are compared: 
an open-source and freely-available online application, Microsoft OneNote®, and 
Nuance OmniPage®. Two data extraction programs were created and compared to assess 
the feasibility of automating the analysis phase. The first program concentrated on 
automated analysis with user review at the end. The second program concentrated on 
continual user interaction throughout the entire process. 

The results of these comparisons advocate the use of professional-grade OCR 
software such as OmniPage® to create a standard file that can be accepted as an input by 
a data extraction program. Based on the consumption documents reviewed by this thesis, 
a manual data extraction program is advised to create a universal output format for later 
use in an appropriate data storage method. 
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I. INTRODUCTION 


A. MOTIVATION 

The United States Marine Corps, established in November 1775, has participated 
in countless operations around the world. Tasked with a multitude of operations ranging 
from major armed conflicts to humanitarian aid missions, U.S. Marine Corps logisticians 
and operational planners must continually plan, maintain, and execute acquisition and 
distribution programs to support approximately 174,000 troops according to [1], 
Regardless of the scale of an operation or mission, they must always answer the 
fundamental logistical question of “how much equipment and supplies do we need for the 
amount of personnel assigned to the mission?” 

In order to answer this question, they must locate the correct reference documents 
that may exist in electronic and hardcopy formats, retrieve the correct consumption data 
that often resides in “usage tables,” and analyze these inputs, providing useful planning 
data for utilization. This cyclic process is depicted in Figure 1. For the purpose of this 
thesis and from a logistical standpoint, “consumption data” is an all-encompassing term 
that describes the raw data contained in the logistical planning factor input documents. 
When a specific example of consumption data is illustrated, it will be presented as a 
“consumption data element.” The term “usage table” is defined in the next section. 



Figure 1. Consumption Data Correlation Process 
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1. The Current Methodology of Manual Correlation 

Under the current methodology, the location, retrieval, and analyzing of 
consumption data is conducted manually by civilian and military personnel in the 
logistics or operations field from variations locations throughout the world. 

a. Location 

The first challenge that must be overcome is the collection of the correct reference 
documents. While some may exist in an electronic format such as a portable document 
format (PDF), others exist as hardcopy books, field manuals, and orders. Thus, the 
planner must have access to both electronic and hardcopy resources. Depending on their 
situation, this may not be possible. 

b. Retrieval 

The next challenge is locating from these documents the correct consumption data 
that often reside in tables. These tables, referred to as “usage tables,” represent the 
standard display format of consumption data elements. Typically, each consumption data 
element is listed in a table with its corresponding consumption rate. Thus, a usage table is 
a collection of consumption data elements and their corresponding rate of consumption or 
allowance. In order to understand the end-user’s use of this tabular data, an illustration of 
one usage table and its properties is given as an example. 

(1) Usage Table. Figure 2 illustrates one instance of a usage table. Each line, 
composed of four properties (columns), represents one single consumption data element. 
Note that for the example, some colu mn s are not included for legibility. 
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Figure 2. Infantry-Heavy Threat Combat Planning Factors Table (from [2]) 
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This table consists of the following columns: 

• Weapon. Consists of two fields: Weapon ID and Nomenclature. These 
two fields are used to uniquely identify the weapon. An expected input for 
this field is alphanumeric characters. 

• Ammunition. Consists of two fields: the Department of Defense 
Identification (DODIC) and Nomenclature. These two fields are used to 
uniquely identify the ammunition being used by the weapon identified in 
the “Weapon” column. An expected input for this field is alphanumeric 
characters. 

• GCE Rates. This column is used to define the consumption rate of the 
Ground Combat Element (GCE) component of the Marine Air-Ground 
Task Force (MAGTF). The GCE is the primary attack element of a 
MAGTF and is expected to have a higher rate of consumption. An 
expected input for this field is an integer or floating point (decimal) 
number. For elements intended for GCE-use only, the “OTHER-THAN 
GCE RATES” column may be empty. This column is broken down further 
into three sub-columns: 

• Daily Assault. This rate is shown as the number of rounds per day 
per weapon or individual in the GCE during the assault (intense) 
phase of combat [2]. 

• Daily Sustain. This rate is shown as the number of rounds per day 
per weapon or individual in the GCE during the sustainment phase 
of combat [2]. 

• Basic Allowance. This rate indicates the basic allowance (BA) of 
the ammunition item recommended to be carried within the means 
normally available to the Fleet Marine Force (FMF) unit 
embarking and debarking for combat operations [2]. 

• Other than GCE Rates. This colu mn is used to define the consumption 
rate of the Co mm and Element (CE), Aviation Combat Element (ACE), 
and Combat Service Support Element (CSSE) of the MAGTF. Overall, 
this column has the same characteristics of the “GCE RATES” column: 
daily assault, daily sustain, basic allowance, and the use of integer or 
floating point numbers. For items intended solely for the CE, ACE, or 
CSSE, the “GCE RATES” column may be empty. 

While not every usage table published by the United States Marine Corps may be 
an exact replica (data and content) of the one shown in Figure 2, it is reasonably-expected 
that each of them will follow a similar table layout or uniquely-structured format. Figure 
3, for example, represents the same information from Figure 2 as a table from a different 
reference document. 
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Figure 3. Class V(W) FY-06 Planning Factors Consumption Data (from [3]) 


c. Analyze 

Having found the correct reference documents and usage tables, the next 
challenge faced by the planner is to analyze the usage tables, extracting out specific 
consumption data elements out as necessary. In order to plan for an operation to field 
100 Marines for example, they may need to gather 20 consumption data elements from an 
electronic usage table and 100 consumption data elements from a different hardcopy 
usage table. In order to keep track of these elements for planning an operation, the most 
realistic approach is to record them into a single location in a universal format. This is 
done primarily via electronic means—Microsoft Word®, Microsoft Excel®, text files, 
etc. Thus, even during the analyzing phase, they must juggle between electronic and 
hardcopy formats. To complicate matters, the logistician may be stateside or deployed, 
may or may not have access to the Internet, may or may not have hardcopy usage tables 
for ready reference, and may not have an extensive planning shop at his/her disposal. 
Also, the user must have some familiarity with usage tables and a working knowledge of 
which usage tables contain specific data elements. Since hardcopy usage tables do not 
have search functionality, inexperienced planning personnel may spend countless hours 
reading through a usage table document only to find that they had the wrong document. A 
clear benefit of having electronic usage tables, in the form of a PDF for example, is that 
the user gains the ability to search through the document using partial or full keyword 
search ability. 

d. Utilize 

After the correct information has been located and compiled, military planners 
and logisticians provide that data to their military commander for planning purposes or 
use the data to accomplish their main task—equipping and maintaining troops. Thus, 
depending on the final document, different variations for displaying the data may be 
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used—databases, spreadsheets, pie charts, etc. While different storage methods are 
discussed for storing the output of the extraction programs, this stage of the process is 
outside the focus of this thesis. 

2. Present-day Solutions 

While no systems have been created to address this specific conundrum, several 
systems have been created to aid in the planning process. Systems such as the Joint 
Operations Planning and Execution System and The Marine Air Ground Task Force War 
Planning System have been used as resources [3]. Some end-users have taken a proactive 
approach, creating stand-alone systems. Using programs such as Microsoft Excel® and 
other user-created applications, these users have attempted to provide temporary 
solutions to the current problem as depicted in Figures 4 and 5. 



A 

B 

C 

^ D 

E 

F 

G 

H 

I 

J 

K 

1 ' 



Unit Daily 

Unit Daily 

Normal 

Total DailyTotal Daily 

Enter 

Enter No of 

Enter No of 


2 ; 


Fuel RqmtFuel Rqmt 

Unit 

Fuel RqmtFuel Rqml 

Actual Unit 

Days in the 

Days 


3 : 

Unit Type 

T€ No. 

Assault 

Sustained 

Multiple 

Assault 

Sustained 

Multiple 

Assault 

Sustained 

Subtotal 

4 

CE 1 

1 










5 

MEF HQ 

N4601 

0 

0 

1 

0 

o| 

I ii 

I o| 

I il 

1 0 

6 ' 

SRIG 











7 : 

H&S Co 

N4606 

4,523 

2,706 

1 

4,523 

2,706 

1 

7 

10 

58,721 

8 i 

HQ Intell Co 

N4607 

51 

26 

1 

51 

26 

1 

0 

1 

26 

9 ' 

TOPO Pit 

N4608 

310 

208 

1 

310 

208 

1 

0 

1 

208 

10 : 

SCAMP 

N4609 

673 

389 

1 

673 

389 

1 

0 

1 

389 

11 I 

FIIU 

N4610 

34 

17 

1 

34 

17 

1 

0 

1 

17 

12 

HUMINTCo 

N4613 

0 

0 

1 

0 

0 

1 

0 

1 

0 

13 ' 


N4614 

0 

0 

1 

0 

0 

1 

0 

1 

0 

14 i 


N4615 

102 

51 

1 

102 

51 

1 

0 

1 

51 

15 ! 

Radio Bn H&S Co 

N4637 

4,406 

3,707 

1 

4,406 

3,707 

1 

0 

1 

3,707 

16 ‘ 

Radio Co 

N4635 

1,270 

697 

2 

2,540 

1,394 

2 

0 

1 

1,394 

17 

Radio Co 

N4636 

1,751 

903 

1 

1,751 

903 

1 

0 

1 

903 

18! 

Force Recon Co 

N4618 

609 

342 

1 

609 

342 

1 

0 

1 

342 

19 i 

ANGLICO 

N4654 

1,931 

1,361 

1 

1,931 

1,361 

1 

0 

1 

1,361 

20 i 

Comm Bn HQ Co 

N4686 

1,106 

614 

1 

1,106 

614 

1 

0 

1 

614 

21 

ServCo 

N4683 

1,563 

1,592 

1 

1,563 

1,592 

1 

0 

1 

1,592 

22 

Gen SptComm Co 

N4684 

4,459 

3,435 

1 

4,459 

3,435 

1 

0 

1 

3,435 

23 

Dir Spt Comm Co 

N4685 

744 

610 

3 

2,232 

1,830 

3 

0 

1 

1,830 

24 








CE SUBTOTAL 

74,590 


25 

GCE 1 

1 









26| 

Division 










27 

HQ 

N1011 

0 

0 

1 

0 

0 

1 

0 

1 

0 

28 

H&S Co 

N1012 

3,163 

1,725 

1 

3,163 

1,725 

1 

0 

1 

1,725 

29 

Recon Co 

N1019 

163 

102 

1 

163 

102 

1 

0 

1 

102 

30 

Truck Co 

N1016 

10,403 

5,399 

1 

10.403 

5,399 

1 

0 

1 

5,399 

31 

Comm Co 

N1015 

2,885 

1,938 

1 

2,885 

1,938 

1 

0 

1 

1,938 

32 

oo i 

MPCo 

1 ion 

N1014 

337 

'i enA 

315 

A a An 

1 

n 

337 

7 CAn 

315 

4 nil ii 

1 

n 

0 

n 

1 

4 

315 

A nA A 


Figure 4. Fuel Planning Worksheet in Microsoft Excel® (from [3]) 
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Figure 5. Spreadsheet utilized by II Marine Expeditionary Force (from [3]) 

Although these user-created applications are made with good intentions, most 
suffer from the same deficiencies: 

• Not accepted as legitimate applications by the U.S. Marine Corps. 

• May contain volatile or malicious code susceptible to attack. 

• Not widely-distributable or maintainable. 

• May contain invalid and/or obsolete information. 

While these solutions are innovative and have some merit, a more stable solution 
that can be accredited, upgraded, and distributed to all users in the Marine Corps in a 
variety of environments is necessary. 

Another simple and inexpensive solution for today’s environment is the use of 
historical information. When constrained by time or resources, planners may rely on data 
from previous operations or exercises to plan for an operation. Since many operations 
may be similar in nature, planning for a new operation may be as simple as changing the 
total troop count or total vehicle count. This methodology has two main drawbacks: a 
lack of flexibility and the potential to be trusted as the definitive data without further 
examination. With the changing environment of the world, new missions may be 
encountered which lack historical examples. Not only does this lack flexibility, repetitive 
use of historical data can erode the core competencies of planners and may provide 
estimates that are clearly inappropriate for the given scenario based on the different 
factors surrounding the mission. 
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From a “cradle-to-grave” standpoint, the current methodology of locating, 
retrieving, analyzing, and utilizing consumption data is a tedious and laborious task. 
Depending on the type and amount of operations being conducted, this compilation 
process may become exponentially time-consuming and complicated. Further aggravated 
by personnel cuts, this process places an undue burden on the planner. 

B. PURPOSE OF STUDY 

This thesis strives to reduce the burden placed on the planner by answering three 
main research questions: 

• “What are the abilities and limitations of current OCR technologies?” 

• “What is the best method for analyzing consumption documents? 
Automated analysis with review at the end or walkthrough analysis with 
review throughout the entire process?” 

Figure 6 illustrates how answering these questions relates to refining the current 
process. 



Inputs Outputs 



Figure 6. Consumption Data Correlation Process (refined) 


The first question addresses the problem of storing both electronic and hardcopy 


documents. With consumption data appearing in both electronic and hardcopy 
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documents, planners must spend eonsiderable time colleeting and transeribing this data 
into one eentral loeation. Not only is this time-eonsuming and labor-intensive, critieal 
consumption data may be omitted or ineorrectly-interpreted. While conducting OCR can 
be labor-intensive at first glanee, the use of an OCR application can reduce the 
administrative burden placed on the planner by (a) allowing them to eonvert hardeopy 
documents into eleetronic documents which can be searehed and (b) allowing them to 
convert pre-existing eleetronie doeuments that can not be searehed into searchable 
documents. These searchable doeuments ean also be used as inputs to data extraetion 
programs that have the ability to extract consumption data elements and provide them in 
a useful output. For example, a (key, value) pair for a database or a simple text file 
containing solely eonsumption data elements. In order to provide an aceurate snapshot of 
OCR, three off-the-shelf OCR applieations were compared: an open-souree, freely- 
available, online application; a licensed version of Mierosoft OneNote®; and a licensed 
version of Nuanee OmniPage®. The aceuracy rate, capabilities, and limitations for each 
application were tested and reeorded in Chapter IV. As a byproduct of this comparison, a 
standard text file was created that was later used by the analysis programs. This text file 
was reviewed and eorrected until its eontents were 100% aecurate, creating “perfect 
inputs” for the analysis programs. 

To answer the second question, two simple analysis programs were ereated and 
compared. These programs used the text file outputs created by the OCR applications as 
inputs. The goal of the first program was to maximize the work done by the program and 
reduee the amount of user interaetion necessary. While the program analyzes the input 
files using pre-defmed logic statements, there is still a requirement for the end-user to 
review and verify the entries at the end. The goal of the second program was to compare 
an alternative approach, requiring continual user interaetion throughout the decision¬ 
making and review process. This program was ereated in two versions: line-by-line and 
page-by-page. The byproduet of this eomparison was the ereation of a standard text file 
and a (key, value) pair in the form of a Python list data structure. These outputs represent 
the last step of the study. Determining the best data storage and presentation approaeh for 
these outputs is left for future research. 
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c. 


PRO/CON ANALYSIS OF AUTOMATION 


On one hand, digitizing and storing consumption data has the following benefits: 

• Reduction in manpower hours necessary to correlate data. 

• Allows access to consumption data from anywhere in the world. 

• The ability to provide consumption data in one location. 

• Eliminates the need to keep documents in hardcopy form, providing cost 
savings and reducing environmental usage. 

• The ability to present the consumption data in various electronic 
formats—database, charts, tables, text documents, etc. 

On the other hand, there are drawbacks: 

• Higher cost in database / application administration (personnel and 
equipment incurred. 

• Requires secondary investment in security and network monitoring 
personnel and systems. 

• Depending on the extensiveness of the system, it may require dedicated 
personnel to operate and maintain the system. 

• Relying solely on the software may, as in the case of using historical data, 
erode the core competency of the end user. 

• Data corruption and/or data loss could result in a lack of availability for 
the system or data. 

• Should this process be completely digitized and the printing of hardcopy 
consumption documents be ceased, an end-user without access to the 
system would be unable to accomplish their tasks. 

While these benefits and drawbacks may not be all encompassing, the progression 
towards a computer-aided system is a natural progression and is detailed in the following 
chapters. 

D. ORGANIZATION OF THESIS 

Chapter II conducted a background study by presenting two forms of OCR, free¬ 
form and template-based, and discusses several options for storing the output from the 
analysis programs. Chapter III discusses two approaches for analyzing electronic 
consumption documents: automated and walkthrough. Chapter IV compares the OCR 
applications mentioned in Chapter II and demonstrates the analysis programs presented in 
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Chapter III. Chapter V covers summary results of Chapter IV, lists recommendations 
reached as a result of this research, and illustrates areas for continued research. 
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II. BACKGROUND STUDY 


A. WHAT IS OPTICAL CHARACTER RECOGNITION? 

This field of study was summarized in 1993 by Line Eikvil, a Norwegian scientist 
who specialized in pattern recognition, computer vision, and text mining as follows: 

Optical Character Recognition deals with the problem of recognizing 
optically processed characters. Optical recognition is performed off-line 
after the writing or printing has been completed as opposed to on-line 
recognition where the computer recognizes the characters as they are 
drawn. Both hand printed and printed characters may be recognized but 
the performance is directly dependent upon the quality of the input 
documents. [6] 

Although there has been refinement and improvement in OCR technology, this 
summary still represents the fundamental principles behind the process. We are primarily 
concerned with a) performing optical recognition and b) ensuring high quality input 
documents are supplied to the process. The latter of which is a universally-expected 
norm—in order for any application to provide the best outputs, it must be given the best 
inputs. As a means to an end, consumption data must exist on a computer and be 
recognized by the analysis program. In order to do this for consumption documents, OCR 
was leveraged to handle two cases: 

• Consumption data contained in hardcopy documents. Here, the 
document exists solely in hardcopy format and OCR must be conducted 
on the document to create an electronic version. 

• Consumption data contained in electronic documents but not 
recognized by applications as a machine-readable format. Here, the 
document exists in an electronic version but the document (or parts of it) 
may be presented as data that an application can’t interpret. For example, 
the Python programming language is unable to natively interpret 
Microsoft Word .docx extensions or data contained in PDF files. 

1. OCR Software Approaches 

In order to best understand the nature of OCR, we conducted OCR on a usage 
table and present the results as an example in Figure 7. Note: this data was originally 
presented in a vertical landscape view. For the purpose of the example, it was manually 
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converted to a horizontal profile view. Chapter IV addresses the ability and accuracy of 
OCR applications to conduct this procedure automatically. 
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Figure 7. Usage Table—Pre-OCR (from [2]) 


Using an OCR application, this table was processed and a Microsoft Word® 
document was created. Figure 8 represents this document with errors highlighted in 
yellow. 
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Figure 8. Usage Table—Post-OCR (after [2]) 


In order to create the output in Figure 8, the OCR application went line-by-line 
through the input document, recognizing, interpreting, and transcribing characters as it 
went along. The output illustrates that the majority of the data was transcribed correctly. 
Most of the errors that occurred were related to the numerical values associated to the 
DODIC and the various rate values on the right-hand side. In particular, numerical values 
that contained decimals and were longer encountered the highest error rate. Of note, 
the OCR application was intelligent enough to place the output data into a table. This 
kind of intelligence helps to refine and preserve the data for later analysis. Although the 
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application generated quite a few errors, this may not be an issue for a user who is using 
OCR for a simple text eonversion tool. However, this error rate is eause for eoncern in 
applications that must maintain accurate information. Before diseussing the manner in 
which this error rate ean be reduced, it is important to understand that two forms of OCR 
exist: free-form and template-based. 

a. Free-form OCR 

The output produeed in Figure 8 represents the use of free-form OCR. This is the 
most eommon and default method for most OCR applications and the method used by 
this thesis. While it allows for maximum flexibility of input formats sueh as text, tables, 
images, and alphanumeric characters, it is “considered slow and inaecurate at times... 
however, using free-form will still signifieantly reduce the amount of errors due to mis- 
keying during manual data entry” [7]. Although free-form was used in this thesis, should 
consumption data present itself in a predietable fashion in the future, another form of 
OCR may provide a higher degree of aceuraey and throughput—^template-based OCR. 

b. Template-based OCR 

Many institutions and corporations throughout the world use template-based OCR 
to conduct data entry for a variety of systems. One such data entry method has gained 
popularity in the last deeade—^mobile banking deposit. Many major banking institutions 
allow members to directly deposit paychecks to their accounts using mobile-banking 
applications. The only requirement is for the end user to have an end-deviee eapable of 
running the application and creating or importing an image (photograph) of the cheek to 
be proeessed. Again, to best understand this method, an example is an appropriate venue. 
Figure 9 is a blank cheek for illustrative and discussion purposes. 
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Figure 9. Blank Check for Template-based OCR Discussion 

In computing, a template is defined as “a computer document that has the basic 
format of something (a business letter, chart, graph, etc.) that can be used many different 
times” [8], A check follows this definition, having features that are pre-defined and used 
many different times in almost all other checks. With regards to understanding template- 
based OCR of a check, a check has the following distinguishable features: 

• Rectangular in shape, having four comers at 90-degree angles. 

• Standard fields to denote data fields (e.g., DATE, PAY TO THE ORDER 
OF, $, DOLLARS, and FOR). 

• Magnetic ink character recognition (MICR) font information at the bottom 
of the check—e.g.. Bank Routing Number, Bank Account Number, and 
Check Number. The MICR font is a standard of the American National 
Standards Institute (ANSI) and was specifically created for recognition on 
checks. 

While financial institutions withhold their proprietary software procedures and 

capabilities, with an understanding of how template-based OCR operates, their processes 

can be demystified to provide an understanding of how this technology can be used to 

interpret consumption documents. First, the check to be processed must be filled out with 

all the necessary fields completed. Note: some companies have software capable of 

detecting when fields are not completed and provide error responses. Second, the check 

must be entered into the system by either taking a picture of it or scanning it using an 

input device. Some applications may direct the input of the check from within the 
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application (e.g., clicking deposit check prompts the user to take a picture of the front and 
the back of the check). Once the check has been placed in an electronic format, 
depending on the sophistication of the application, it may be processed at the end-device 
or sent to a data processing system to verify the accuracy of the check. This is the phase 
where template-based OCR processing occurs. During this processing, the following 
questions are answered: 

• Is the document a check? This can be done by verifying the shape and 
length of the document as well as verifying the 90-degree angles are 
present. Some companies may even have the ability to detect when the 
comer of a check is tom, however, such information is simply not known 
due to the proprietary nature of the software. If the document is not a 
check, an error should be returned. 

• Are all the fields filled out? This is done by observing the marks that 
occur within the standard data fields. For example, a valid date should be 
after the DATE label and a valid numerical value should be after the dollar 
sign ($). If “$ ILUVCOOKIES” was written on the check, the software 
should be intelligent enough to understand that ILUVCOOKIES is not a 
valid numerical value and return an error. 

• Is the accounting information at the bottom valid? This is done by 
interpreting the numerical values in MICR font at the bottom. Comparison 
of the name on the check to the account holder information would be a 
likely check for authenticity. If the accounting information is not correct, 
an error should be returned. 

Finally, based on the results of the processing, the user should be given an 
acceptance or rejection status of the overall transaction. Figure 10 illustrates this process 
from beginning to end. 
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Figure 10. Template-based OCR Process of a Check 


Throughout the process, the accuracy of the input (the check) is verified against a 
template (features of a check). Should the check deviate from the accepted template of a 
check or contain erroneous attributes, the software should default with an error. The 
important takeaway in this scenario is the need for template compliance, accuracy, and 
readability. In order for template-based OCR to interpret consumption data, the data 
should present itself repetitively and in the form of a template such as a table. Based on 
the consumption documents reviewed by this thesis, consumption data is a conglomerate 
of free text, tables, and lists, advocating the use of free-form OCR. 

2. Increasing OCR Accuracy 

Having discussed the various forms of OCR, we are able to return to the 
discussion of improving input accuracy with an appreciation for its importance. The 
usage of either free-form or template-based OCR is based on the input supplied to the 
OCR application. The effectiveness and accuracy of the application, independent of the 
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OCR format chosen, is based upon the clarity and readability of the given input. In 
computer science, there is often use of the acronym GIGO—garbage in, garbage out. 
Should the applieation for analyzing and interpreting consumption data receive bad 
inputs, it will most likely produee bad outputs. Thus, prior to converting consumption 
data using OCR, the highest emphasis must be plaeed upon ensuring the “cleanliness” of 
the input documents. In order to do so, the following steps should be taken: 

(1) Paper-to-electronic conversion. Should a consumption document exist 
solely in hardcopy form, the best known copy should be used for OCR eonversion. This 
document should have minimal interferenee and degradation. For example, if pages are 
tom, they should be replaeed. Smudges, ink blots, and extra markings inside the 
document should be removed. OCR may attempt to recognize these markings and 
produee erroneous results. Note: photoeopied doeuments tend to lose quality through 
blurring and fuzzing and should be used as a last resort. 

(2) Electronic-to-electronic conversion. Should a consumption document 
exist solely in eleetronic form, the best known copy should be used as the input 
document. Again, the doeument should be inspeeted with the same regard as the paper to 
electronic conversion. Should the electronic document be of poor quality, another 
document should be used or the current one rewritten. 

(3) Document formatting. For optimal processing, consumption data should 
be displayed in the appropriate format—text, table, eolumns, rows, etc. While OCR will 
attempt to identify and render this data intelligently, giving it the desired input for an 
expected result is recommended. For example, if you wanted the OCR applieation to 
create a table of data, provide it with a table of data. When possible, all pages should be 
presented in the same page layout. For example, all pages in the document should be 
presented in either portrait or landseape format—not a mixture of the two. 

Although some of these steps may be manpower intensive up front, the dividends 
they pay in the long run may outweigh the eosts ineurred with auditing the OCR output. 
Reformatting and retyping a document may take several hours or even days. Likewise, 
the same amount of time may be spent auditing and eorrecting the OCR output if the 
OCR software misinterprets inputs and provides unreadable and/or gibberish outputs. For 
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example, the software may present 100 lines of data in portrait view when the original 
input was 3 lines of data presented in landscape view. Thus, it is incumbent upon the 
person who collects the electronic and hardcopy consumption documents for OCR to 
make the decision as to whether or not the document is in an acceptable state. 

3. OCR Summary 

OCR comes in two forms: free-form and template-based. Due to the nature of the 
consumption documents reviewed by this thesis, free-form OCR is leveraged. Emphasis 
is placed upon refining the input documents prior to OCR. In order to leverage OCR to its 
maximum effectiveness, it should be presented with the best inputs. The same mentality 
can be applied to the process of making wine: poor-quality grapes can seldom be 
mitigated by the winemaker. 

B. METHODS FOR STORING CONSUMPTION DATA 

The byproduct of the analysis programs compared in Chapter IV created two 
outputs: a standard text file and a (key, value) pair. To illustrate the rationale behind this 
decision, the following data storage methods are discussed: traditional databases and a 
document repository. 

1. Traditional Database 

A database can be leveraged to store consumption data. Two types of databases 
currently exist: relational and non-relational. 

a. Relational Database Model 

Inside of a relational database, data is represented in a schema (a framework), 
consisting of tables that have interconnecting relationships. Data may be spread across as 
few as one table or as many as thousands in order to correctly represent entities and 
relationships between the data elements. Each element (entity) in a database table has a 
unique identifier, referred to as the primary key, which prevents duplicate information 
from existing. Multiple entities are mapped together using relationship tables. In order to 
make this data available to the end-user, a database server is created and hosted, allowing 
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users to query the database. In order to query the database server, a user will typically 
build queries using structured query language (SQL) constructs. Figure 11 illustrates how 
one consumption data element may be represented in a relational database. Figure 12 is 
an illustration of the query submitted and the query’s result based on the data shown in 
Figure 11. 


WcapoD Table (Entity') 


Weapon ID 

Nomenclature 

B0471 

Squad Demolition Set 


Ammunition Table (Entity) 


DODIC 

Nomenclature 

M032 

Charge, Demo Block 1 LB TNT 


GCE Rate Table (Entitj') 


DODIC 

Daily ASSAULT 

Daily SUSTAIN 

Basic Allowance 

M032 

15.00753 

3.39605 

48 


Other than GCE Rates Table (Entity ) 


DODIC 

Daily ASSAULT 

Daily SUSTAIN 

Basic Allowance 

M032 

15.00753 

3.39605 

15 


Part of Table (Relationship mapping between Weapon and Ammunition) 


Weapon ID 

DODIC 

M032 

M032 


Rates for Table (Relationship mapping between Ammunition and Rates) 


DODIC 

Daily 

Daily 

Basic 

Daily 

Daily 

Basic 


ASSAULT 

SUSTAIN 

Allowance 

ASSAULT 

SUSTAIN 

Allowance 

M032 

15.00753 

3.39605 

48 

15.00753 

3.39605 

15 


Figure 11. Consumption Data Element in Relational Form 
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select • 


from Weapon, Ammunition, GCE Rate, Other than GCE Rates, Part of. Rates for 
where Weapon.WeaponlD = '‘30471’' and Ammunition.DODIC = “M032” 


Weapon 

ID 

NomencUture 

DODIC 

Nomenclature 

B0471 

SQUAD 

DEMOLITION SET 

M032 

CHARGE, DEMO BLOCK 1 LB TNT 

B0471 

SQUAD 

DEMOLITION SET 

M130 

CAP, BLASTING ELECTTC 







Daily 

Daily 

Basic 

Daily 

Daily 

Basic 

Assault 

Sustain 

AUowatice 

Assault 

Sustain 

Allowance 

15.00753 

3.39805 

48 

15.00753 

3.39605 

15 

18.01945 

4.00000 

150 

18.01945 

2.03084 

150 








Figure 12. Relational Database Query and Result 


Figures 11 illustrates the main drawback of implementing a relational database. 
Although only one consumption data element was given, six tables had to be created in 
order to correctly represent all of its data elements. While this may seem insignificant at 
first glance, this problem becomes more pronounced when data is spread across hundreds 
or thousands of tables. Here, a cost in computing time is incurred to search through each 
table, establish relationships, and present subsequent tables. Additionally, any inputs into 
the system must strictly adhere to the existing format, or “schema.” Any inputs that 
deviate from the appropriate input format will be rejected or cause an error. 

b. Non-relational Database Model 

The main goal of using NoSQL (“Not Only SQL”) is to break away from the 
problems associated with maintaining relationships in relational databases. Since a 
relational database must keep track of the relationships contained within the database, it 
must create extra tables in order to do so. This problem is avoided with NoSQL 
implementations by offering a variety of opportunities to store data. One of the common 

approaches is the use of a (key, value) pair to represent a single unique data element. This 
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(key, value) approaeh illustrates the use of a fundamental eomputer scienee data 
strueture—the dictionary. The National Institute of Standards and Technology (NIST) 
define a dictionary as “an abstract data type storing items, or values. A value is accessed 
by an associated key. Basic operations for manipulating a dictionary are new, insert, find 
and delete” [5]. Figure 13 illustrates how one consumption data element may be 
represented in dictionary format. Figure 14 is an illustration of the query submitted in 
Java and the result which would be shown using the data shown in Figure 13. 


Weapon Diction a ty 


Weapon ID (ke>') 

Consumption Information (value) 

B0471 

SQUAD DEMO SET, M032, CKMIGE, DEMO BLOCK LB TNT, 

15.00753, 3.39605, 48, 15.00753, 3.39605, 15, M130, C^, 

BLASTING ELECTRIC, 18.01945, 4.00000, 150, 18.01945, 

2.03084, 150, ... 


Figure 13. NoSQL (key, value) Dictionary Example 
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Program Code 

public static \'ciid main{String[] args) { 

HashMap map = new HasliMap{); 

II This is where values are entered into the dictionary' 

map.put{“Bfl47r’, SQUAD DEMO SET, M032, CHARGE, DEMO BLOCK 1 
LB TNTT, 15.00753, 3.39605, 48, 15.00753, 3.39605, 15, M130, CAP, 
BLASTING ELECTRIC, 18.01945,4.00000, 150, 18.01945,2.03084, 150, ...); 

String querv'Result = map.getC'B0471”); <- This is where values are retrieved 

... parsing and string manipulation code not sho^m for simplicity'. In a real-^^rld 
application, parsing logic would require CPU resources of the end dev'ice 
requesting information from the serv'er. 

Program Output 

/AVhile one form of potential output sho^t.!!, there are unlimited possibilities of 
representing this data to the end user. 

Search results for “B047r’ are; 


WeaponID; B0471 

Nomenclature; SQUAD DEMO SET 

Ammunition; 

Subcomponent 1; 

DODIC; M032 

Nomenclature; CHARGE, DEMO BLOCK 1 LB TNT 
GCE Daily Assault; 15.00753 
GCE Daily Sustain; 3.39605 
GCE Basic Allow'ance; 48 
Other than GCE Daily Assault; 15.00753 
Other than GCE Daily Sustain; 3.39605 
Other than GCE Daily Basic Allowance; 15 
Subcomponent 2; 

Subcomponent 3; 

Figure 14. Java-implemented Dietionary and Query: Result 


While a (key, value) implementation is diseussed and used as the output format of 
the analysis program in this thesis, JavaScript Object Notation (JOSN) and extensible 
markup language (XML) formats also exist. 
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c. Database Architecture 

Figure 15 illustrates how a database can be implemented. Using the illustration as 
a guide, we further discuss each component of the database infrastructure in detail. 



Secondary 


Primary 


Secondary 


MILCOM 

Satellite 










Secondary/ 
Stand - Alone 


Figure 15. Global Database Architecture 


The following components are necessary to implement a database: 

• Primary Server-The primary server is responsible for providing each 
secondary server the most valid and updated information. 

• Secondary Server(s)-To distribute the processing load off the primary 
server, secondary servers are created to handle transactions. Commonly, 
these servers are implemented in various geographic locations to not only 
distribute the load, but provide faster responses to the end user. 

• Secondary/Stand-Alone Server(s)-Similar to the secondary server, a 
stand-alone server could be implemented at a forward operating base or 
onboard a ship in an amphibious ready group (ARG). Figure 15 illustrates 
the use of a server in an ARG environment. Here, updates can be sent back 
and forth between the stand-alone database and the primary server when 
connectivity is available over a military communications (MILCOM) 
satellite or other means. During periods of non-connectivity, elements in 
the database may become obsolete. 

• Intercommunication Capability/MILCOM Satellite-Connections between 
the land-based primary and secondary servers could be done over a variety 
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of mediums—direct link, satellite, dedicated line, etc. Communications 
between a database at a FOB or an ARG requires reach-back capability via 
satellite communications using either a MILCOM satellite or commercial 
provider, such as INMARSAT. Such communication is essential to 
replicating changes from the primary database to each secondary 
implementation. 

Database locations require additional hardware and software: 

• Storage devices-required for storing the database software and data 
contained in the database itself 

• Network devices-provides connectivity in and out of the database. 

• Input devices-mouse, keyboard, scanners, etc. 

• Database software-necessary to run the database. 

Figure 16 illustrates the necessary components for a database. 



Figure 16. Database Location Hardware Instance 


24 

















2 . 


Document Repository 


For the purpose of this discussion, we reference the Department of Defense 
(DOD) website that lists all valid DOD instructions. Figure 17 displays a portion of this 
site for illustrative and discussion purposes as shown in [4], 


Copy to dipboard CSV Excel PDF Print 

ISSUANCE A 
NUNBER ^ 

ISSUANCE A 
DATE ^ 

IS^JANCE A 

SUBJECT ^ 

CHANGE 

# 




IDENTinCATION (ID) 

DqDI 

1000.01 

4/16/2012 

CARDS REQUIRED BY 
THE GENEVA 

CONVENTIONS 

DqDI 

1000.04 


FEDERAL VOTING 

9/13/2012 

ASSISTANCE PROGRAM 
(FVAP) 

DqDI 

1000.11 


HNANCIAL 

1/16/2009 

INSTITimONS ON DOD 

INSTALLATIONS 



IDENTinCATION (ID) 
CARDS FOR MEMBERS 

DqDI 

1000.13 


OF THE UNIFORMED 

1/23/2014 

SERVICES, THEIR 
DEPENDENTS, AND 
OTHER ELIGIBLE 

INDIVIDUALS 




PROCEDURES AND 

SUPPORT FOR NON- 

DqDI 

1000.15 

10/24/2008 

FEDERAL ENTITIES 

AUTHORIZED TO 

OPERATE ON DOD 


Showing 1 to 714 of 714 entries 

Figure 17. DOD Instructions in Circulation (from [4]) 


This website gives the appearance of a database. The information is displayed in a 
table with columns. It has search and fdter capability with fiill and partial-match 
functionality. The website is accessed via a compatible web browser such as Microsoft 
Internet Explorer, Mozilla Firefox, or Google Chrome. While this structure gives the 
appearance of a traditional relational or non-relational database, it has been created using 
JavaScript and HTML. When a hyperlink is clicked, the user is directed to the resource 
that is located in the web server’s directory. Thus, no “query: result” is conducted. When 
the user clicks a hyperlink, an HTTP GET request is sent to the server. The web server 
then sends the material requested back to the user. In the case of this example, a GET 
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request returns a document that exists as a PDF in the server’s directory. This method 
differs from a traditional database in the way data is stored. Unlike a database, an entire 
fde is stored in a directory that can be accessed. Users are able to search and retrieve by 
document name but do not have the ability to search all the fdes at once. 

3. Data Storage Summary 

Traditional databases can be leveraged to store consumption data. Relational 
databases can maintain relationships between entities—“Weapon and Ammunition” for 
example, but may become cumbersome with large quantities of data. Non-relational 
databases circumvent this problem by relaxing input types and eliminating the need to 
maintain relationships between data elements. However, non-relational databases are 
less mature and few people are skilled to maintain them. Alternative methods for storing 
data exist. Storing consumption documents in text fdes on a web-server is one such 
alternative. 
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III. ANALYSIS PROGRAMS 


A. PROGRAM CREATION 

Once the consumption documents have been converted into an electronic format 
and collected in a centralized location by the planner, the process of analyzing the 
contents of these documents may begin. Currently, no programs exist that have the sole 
purpose of extracting consumption data elements. This is an important statement because 
it illustrates the infancy of such a program. Programs are often created by understanding 
and copying the core functionality and limitations of another similar program. Thus, we 
must look at this program from the ground up. 

1. Language Selection 

Each and every program in existence is created out of lines of code. These lines of 
code are written using a programming language. One such language. Python, is a widely- 
known and implemented language and will be used as the language of choice for this 
thesis. It can be explained easily (compared to the other languages) and avoids many 
restrictions that other languages must enforce. 

2. Design and Functionality Specification 

Figure 18 illustrates the flow of data in and out of the proposed extraction 
programs, providing a framework for discussion. 
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Figure 18. Application Flow 


a. Data Import (1.1)—Open the Input File 

The first step the program must take is the opening of an input file. A number of 
input file types may exist depending on the OCR software used to create them. However, 
these file types must be recognized by the program. To prevent any problems, the 
programs presented in this thesis used a simple text file, specifically the use of a .txt 
extension. Figure 19 illustrates a simple file-open routine that can be used by the program 
to open a file and count the number of lines it has. This is an important step because it 
ensures the program can conduct basic operations on the input file. 
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Note: text following 5 pound sign (tl) represents a 

# comment and is not executable code, 

# This program asks the end user to specify the name of an 

U input document. Once the user has done so., the file is then 
opened and the number of lines are countedK 

print ("Please enter the file name of the input document:") 

filename = input(] H Here the user specifies the input name. 

filename = filename + ",txt" # We append .txt file extension, 

file = open(filename, 'r') # We open the file, 

lineCount = 0 

for line in file: 

IlneCount = lineCount + 1 # And then count the number of lines, 
print ("Line count for^' + fller>ame + is: " str{ lineCount)) 

-OUTPUT- 

Please enter the file name of the input document: 
consumptiondata 

Line count for consumptiondata.txt is: 3 

Figure 19. Open Input File Routine with Line Counting Logic 
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This simple program first prompts the user for an input file name - 
“consumptiondata” in this example. Second, the program appends the .txt file extension. 
This can be removed if the user is restricted to entering a filename ending with .txt. 
Third, the file is opened in read-only mode specified by the “r” option. Other access 
options exist, namely the write option, which will be discussed in the “Export to File” 
section. Lastly, the program walks through the file and counts each line as it is 
encountered. This number is then printed to the screen. In this example, 
“consumptiondata.txf’ contained three lines which the program correctly interpreted. 
Note that for simplicity, and in light of the OCR conversion to be discuss later, this 
rudimentary program restricts the user to files with a “.txf ’ extension; however, a more 
complex implementation might present the user with a “chooser” window (“modal box”) 
that allows the user to select (“choose”) a file from a drop-down list of files contained 
within a given directory (“folder”) stored within the host system. 

b. Read in Inputs and Store, Close Input File (1.2 and 1.3) 

Once the input file is open, data can be read out of the file and manipulated in the 
program’s memory space. Once the contents of the file have been read as input, the file 
can then be closed. Figure 20 illustrates a program that can handle these tasks. 
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#This program reads in the inputs from the open file 

# and places them into a list for further processing. 

# Once the contents have been read in^ the input file is 
f| closed and contents of the list is printed. 

consumptlonData = [] Python list data structure 

listSize = 0 # Counters for program termination 

i = 0 

for line in file; 

consumptionData.appendfline) #f Place each line in the list 

listSize = listSize + 1 # Keep track of the amount of elements 

print (‘^Number of lines In the list Is: " t strflistSlee) + "\n") 

file,clo5e{) # Close the input file, no longer needed 

while (i < listSize): 

print [consumptionDatali]) # Print out each line of input frorn 
i = i + 1 ff the secondary data structure 

-OUTPUT- 

Number of elements in the list is: 3 

B0471 SQUAD DEMOLITION SET M032 CHARGE, DEMO BLOCK ILB TNT 

B0471 SQUAD DEMOLITION SET M130 CAP, BLASTING ELECTRIC 

00471 SQUAD DEMOLITION SET M131 CAP, BLASTING, NON-ELECTRIC 

Figure 20. Read-in and Storage of Inputs 


This program begins by creating a Python list data structure named 
“consumptionData” with two program counters to control program execution and 
termination. Once these have been initialized, each line of the input file, 
“consumptiondata.txt,” is read into the list data structure. For tracking purposes, the size 
of the list is incremented as each new element is placed in the list. This allows us to keep 
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track of the size of the list in the variable “listSize.” After all the data elements have been 
read, the fde is elosed and all the elements are printed to the sereen. 

c. Analyze, Verify, and Correct Inputs (2.1-2.3) 

At this point, the consumption data elements have been aeeessed by the program 
and can be further analyzed and manipulated without the need of the input fde. Two 
different approaehes may be used to handle this phase: 

• Walkthrough Analysis. Using this approach, the end user is heavily- 
involved in the deeision-making proeess on a line-by-line, paragraph-by- 
paragraph, or page-by-page basis. Having been read into the program- 
provided data strueture, a line, paragraph, or page of potential 
eonsumption data is presented to the user. The user must then make a 
decision of “yes or no” that the item(s) presented is/are valid data 
element(s). If the user responds “yes,” the item(s) are transcribed into a 
seeondary data strueture—one which consists of all the valid data 
elements. If the user responds “no,” the element(s) are disregarded and the 
user is presented a new data set. Alternatively, the user eould be presented 
an opportunity to access a given data element and manually enter the 
eorreet information, thereby ensuring the consumption data is consistent 
with the original souree. The goal of this approaeh is to proeess the entire 
document without the need to go back through it repeatedly. Although this 
approach is time consuming, it aims to ensure all the data elements in the 
doeument are analyzed. Should the logie to automatically analyze the 
document not exist, this would be a feasible alternative. 

• Automated Analysis. Under this approaeh, the end-user is involved in the 
decision-making process at the end after the program has made a “best 
effort” to autonomously extraet all eonsumption data elements. This 
method is preferred over the walkthrough analysis only if the parsing logie 
is extensive and aecurate. The end goal of using this method is to save 
time by allowing the program to quiekly analyze the document and present 
a summary at the end. Once the summary has been populated, the end-user 
must then review the output for correctness. Should the parsing logie be 
subpar, the end-user may find it takes longer to eorreet the summary than 
it would be to eonduct the walkthrough analysis. 
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Determining the best method to use—^walkthrough or automated—depends on a 
wide variety of factors, both personnel and software-related. Answering these questions 
may help to address this conundrum: 

• Is the OCR output file an accurate depiction of the original consumption 
document? 

• Is the end user adequately trained to verify the analyzed data elements? 

• Is the program mature enough to handle automated analysis? 

• Is the parsing logic robust enough to recognize and extract all of the 
potential data elements? 

Although both approaches are discussed, the automated program was specifically 
designed to handle consumption documents reviewed by this thesis. In order to create a 
more robust automated application, further document analysis and logic test creation 
must occur. However, the walkthrough analysis program could be used to analyze any 
input file since it treats a line of data unambiguously. Figure 21 shows the input values to 
the walkthrough analysis program. Figure 22 shows the program and its output. 


This is not force consumption data. 

Ii047l SQUAD DEMOLlTIOrJ SET M032 CHARGE, DEMO BLOCK 11.6 TNT 
These are random numbers: 1234567890. 

B0471 SQUAD DeMOLITIOM SET M130 CAP, BLASTING ELECTRIC 

Figure 21. Walkthrough Analysis Inputs—Consumptiondata.txt 
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# This Is a program to conduct walkthrough analysis, 

inltialConsumptionData = [] # initial contents of the input file 

finalConsumptionData = [] # Validated outputs 

listSize = 0 # Counters for program termination 

finalListSiie = 0 

i = 0 

# First, read in all the inputs into the initial list data structure, 
for line in file; 

InitialConsumptlonOata.appendtIine) 
listSize = listSize + i 

file.close() # Close the input file, no longer needed 

ti Afterwards, allow the end user to validate data elements 
userResponse = # A variable to store the user's response 

while (True); 

print ("Is this a force consumption data element?'') 
print (InltialConsumptionData[i]) 
userResponse = input[) 

if (userResponse == "yes"): # Only keep valid inputs 

finalConsunnptionOata,append(inltialConsumptionData(i]) 
finalListSize = flnalUstSlze + 1 

i = i + 1 

if(i == listSize): # Terminate the program 

break 


i = 0 

while (i < finalListSize): 

print (finalConsumptionData[i]) # Print out the valid outputs 

i = I + 1 

-- OUTPUT --- 

80471 SQUAD DEMOLITIO^f SET M032 CHARGE, DEMO BLOCK ILB TNT 

B0471 SQUAD DEMOLITIOM SET M130 CAP, BLASTING ELECTRIC 

Figure 22. Program to Execute Walkthrough Analysis 
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First, the program reads the post-OCR inputs into a list data strueture named 
“initialConsumptionData.” Once this is complete, the input fde is closed. Afterwards, a 
loop control structure is used to cycle through all the data elements in the list. As each 
data element is presented, the end-user must make a “yes” or “no” decision that the 
element is a consumption data element. If the user enters “yes,” this data element is 
copied into the final data structure named “fmalConsumptionData” and the process 
continues until no more elements exist. Elements that do not receive a “yes” decision are 
skipped and the process continues until no more elements remain to be considered. At the 
end, the final list of data elements, having been verified by the end user, is printed out. 

In addition to the ability to individually validate each element, logic can be 
included to allow the user to correct each element as necessary, as noted above for the 
walkthrough analysis. For example, the question “is this element accurate?” can be 
prompted to the user, allowing the user to inspect each data element for accuracy. Should 
the element be inaccurate, the program would then allow the user to modify the value of 
the element prior to placing it in the output data structure. Although this presents a line- 
by-line review of the data elements, many different variations can be created. For 
example, instead of showing one element at a time to the end-user, a page of data can be 
presented with each line having an associated index number. The user could then indicate 
which lines are consumption data elements and they would be copied over in the same 
manner. 

The second approach, an automated analysis program, utilizes the same 
consumption data example inputs shown in Figure 21. While the goal is to achieve 
automation, the program still requires the end-user to review the summary results created 
by the program prior to saving. Figure 23 illustrates how pre-defmed logic statements can 
be used to facilitate automation for most of the program. 
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# This is a program to conduct automated analysis. 


inillalConsumptioriDala = f] t( initial contents of the Input file 

finalConsumptionData = [] # Validated outputs 

listSize = 0 # Counters for program termination 

finalListSize = 0 

1 = 0 

ft First, read in all the inputs into the initial list data structure, 
for line in file; 

I nitialConsumptionData.append! line) 
listSIze = listSize + 1 

flle.close() # Close the input file, no longer needed 

ft Afterwards, allow the program to make a decision based on logic statements, 

searchstring = "SQUAD DEMOLITION SET" ft Pre-defined force consumption element 

for element in initialConsumptionData: ft Inspect each element in the list 
If searchstring In element; 

finalConsumptionData-append{element) ft Found a match, add to output 
finalListSize = finalListSize + 1 

while (i < finalListSize): 

prlnt(flnalConsumptlonData(l]) ft Print the outputs 

1 = 1 + 1 

-OUTPUT--- 

B0471 SQUAD DEMOLITION SET M032 CHARGE, DEMO BLOCK 1 LB TNT 

60471 SQUAD DEMOLITION SET M130 CAP, BLASTING ELECTRIC 

Figure 23. Program to Execute Automated Analysis 


Similar to the walkthrough analysis program, this program begins by reading and 
storing inputs into a list data structure. One approach for designing logic statements is to 
create search strings. In this example, the string “SQUAD DEMOLITION SET” is used 
to represent a consumption data element from the Class V(W) planning factors table. 
After this has been defined, the program scans through each element of input and does a 
logical comparison between the input and the desired search string. If the program is 
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presented with a line whieh eontains the pre-defmed seareh string, it plaees that particular 
line into the output listing. This test can create false-positives and must be further 
defined. This simple example merely illustrates how the program can conduct its own 
logical comparison of inputs without the need for constant human interaction. It also 
helps to illustrate how one could approach logic creation. For example, one could create a 
consumption data dictionary containing an extensive amount of terms that can be 
searched and compared against. Once this dictionary is populated, the program could then 
test to see if any of those words existed in a particular line of consumption information. 
Words such as “table” and “factors” would be good candidates for the dictionary along 
with specific consumption data table names. 

It is important to note that neither the walkthrough nor the automated analysis 
programs represent a panacea solution. Due to the infancy of such a program, it may be 
necessary to begin with more of a hands-on application such as the walkthrough analysis 
program, which evolves over time into more of an autonomous solution. As the pre- 
defmed logic base of the automated program becomes more mature, the program will 
produce more reliable results. Refinement and standardization of the input documents 
will also help to increase reliability. 

d. Export Analyzed Data (3.1 and 3.2) 

Once the end-user has reached this part of the program, the user should be in a 
position where they have the refined consumption outputs. These outputs should be 
carefully-reviewed and corrected to be consistent with the original source document. 
Although we use the term “end-user” to relate to the user of the application, we 
wholeheartedly expect that this data has also gone through multiple levels of verification 
(i.e., “up the chain of command”). Once this information has been vetted, it can then be 
directed as input to a database or saved to a file. 

(1) Export to Database 

Under this approach, the program can leverage the use of pre-defined software 
packages that interact with database systems. For example, Python has a module package 
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named “_mysql” that allows it to interface with Oracle’s relational database software— 
MySQL. 

(2) Export to File 

Instead of sending the outputs to a database, they can be redirected to a file. In 
order to do this, the program simply opens a file in the write (“w”) mode, as previously 
mentioned. By saving to a file, many different options exist based on the output value 
format (e.g., .docx, .txt, .xml). 

Having discussed methods for processing a file once created, the next chapter 
discusses options for OCR applications by which the files themselves may be generated 
from existing hardcopy or non-character-based files. 
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IV. FIELD DEMONSTRATION, TESTING, AND RESULTS 


A. OCR TESTING AND COMPARISON 
1. Testing Environment 

Pages were selected from various consumption documents that illustrated the 
different types of data and layouts that are commonly encountered by the planner: 

• Figure 24 illustrates the front page of a Marine Corps order. This page 
contains information that provides background information and usage 
procedures for the document. However, it also contains information that 
can be used to uniquely identify the document and can be extracted for use 
as the key in a (key, value) pair. This example will be referred to as EX-1. 

• Figure 25 illustrates a page that contains a usage table. This table is 
presented in profde view but must be turned 90 degrees clockwise to a 
landscape view to read it. This example will be referred to as EX-2. 

• Figure 26 illustrates a page that contains a usage table in portrait view. 
This table appears as an image in the input document. This example will 
be referred to as EX-3. 

• Figure 27 illustrates a page that contains a usage table in portrait view. 
This table has been previously-created using table formatting from another 
text editor. This example will be referred to as EX-4. 

Each application was graded on a scale of 1-3, one for poor, two for fair, and 

three for good, respectively. A summary of these scores is presented at the end of this 

section. The following factors were compared and graded for each application: 

• Accuracy Rate. The accuracy rate is depicted as the number of errors per 
page. When a data element is not converted or converted incorrectly, we 
counted it as a single error. A data element is defined as one single entry 
or word. For example, “B0471 SQUAD DEMOLITION SET” illustrates 
four data elements: B0471, SQUAD, DEMOLITION, and SET. Grading 
of this criteria was based on three thresholds: 

• 0-50 errors per page. Awarded a three. 

• 50-100 errors per page. Awarded a two. 

• >100 errors per page. Awarded a one. 

• Consistency. Consistency was tested to see if the applications return 
regular results and whether or not they encounter the same errors when 
given the same input document. To test consistency, each page was 


39 



scanned with each OCR application 10 times. Grading of this criteria was 
based on the following: 

• 8-10 OCR attempts produced the same/similar result. Awarded a 
three. 

• 4-7 OCR attempts produced the same/similar result. Awarded a 
two. 

• <4 OCR attempts produced the same/similar result. Awarded a 
one. 

• Speed. Speed was measured as the amount of pages that could be 
converted per hour. Speed also takes into account the time necessary for a 
user to make corrections. Thus, speed is also directly-related to the 
accuracy rate. Grading of this criteria was based on the following: 

• 45-60 pages processed per hour. Awarded a three. 

• 30^5 pages processed per hour. Awarded a two. 

• <30 pages processed per hour. Awarded a one. 

• Ease of Use. This relates to the user’s ability to efficiently use the 
application. This factor is based on our use of the program and may not 
accurately represent a novice user. Time to master functionality of the 
program guided the following grading criteria: 

• <5 minutes required to master the program. Awarded a three. 

• <15 minutes required to master the program. Awarded a two. 

• >15 minutes required to master the program. Awarded a one. 

• Functionality. This relates to the application’s ability to provide useful 
and helpful tools to the user—e.g. spellcheckers, input formats, output 
formats, etc. Grading of this criteria was based on the following: 

• The application had numerous input/output file types and included 
functionality that substantially increased end-user productivity. 
Awarded a three. 

• The application had several input/output file types and included 
functionality that increased end-user productivity. Awarded a two. 

• The application had limited input/output file types and included 
functionality that had a limited impact on end-user productivity. 
Awarded a one. 
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DEPARTMENT OF THE NAVY 

HEADQIUARTERS LFNITED STATES MARINE CORPS 
WASHINGTON, DC 2O3SO-M01 



MARINE CORg5 ORDER 52IQ.IE 


MCO SOI0.IE 
C 3 92 
15 Apr 97 


FroTT: CGramandant of the Marine Corps 

To t Distribntion List 

Elib: ; CLAES V(H) PLAHNIHG FACTORS FOR FLEET MARI HE FORCE COMBAT 
OPERATIONS 


Ref ; 


faj Marine Corps Ground Anmunicion War Materiel 

Requirement {WMR) Determination (1995-1996) study 
Final Report (NOTAL) 
fti> MCO P4400.39G (NOTAL) 

(C) FMFM 4-1 
fd) FM 9-6 
(e) FM 9-13 


Enel; 


(1) Explanation of the Scenario-Ease Combat Planning 
Factors Tables 

(2} Infantry-Heavy Threat Combat Planning Factors Table 
(3} Armor-Heavy Titireat Combat Planning Factors Ta b le 
f4> con^josite conbafc Planning Factors Table 
(S> Combat Planning Factors for Special Operations 


(6> Artillery Ancillary Items 


1. PUTT^ose . TO promulgate class v(w) combat planning factors (CPF's) to 
support Fleet Marine Force fFMF) combat operations. 


2. Cancellation . MCO aoio.lD. 

3. Background . Reference fa) reports the results of the Marine Corps Class 
v(w) WMR study (1995-1996). Reference (b) establishes Marine corps policy 
governing requirements determination, acquisition, management, and 
distribution 

of war reserve materiel. References (o), fd) and (e) provide logistical 
doctrine and associated tactics, techniques, and procedures for 
class vfw) support during combat operations. 


4. Planning Factors . Factors to be used during initial planning for combat 
operations are explained in enclosure (ij and shown in enclosures (2) through 
(5) . 


a. CPF's reflect the anticipated expenditure of ground ammunition over 
designated time periods of combat operations. These rates represent the 
unconstrained requirement. Unconstrained requirements are based on approved 
force structure, weapon mix, anticipated duration of combat, and the 
anticipated intensity of conflict, once version 2.l of the Ammunition 
Prepositioning and Planning System (APPS) is fielded, the APPE 

DISTRIBUTION STATEMENT A^ Approved for public release? distribution is 
unlimited. 


Figure 24. EX-1, Marine Corps Order Front Page (from [2]) 
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Figure 25. EX-2, Usage Table in Profile View (from [2]) 
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TAHLE 2. rOhfiJlT PUUHTINS factors for SPECIAt. OPERATIONS 


DOPIC-. NOMEWCIATURB 




QUANTITV PER 
- wBTjrsoci _ 


Mil 

CTG^ 

i; GA 0^ BUCK 


4BO 

A014 

CTO^ 

12 GA 7.5 SHOT 


4 50 


era. 

12 GA SLUD 


250 

M24 

c?c. 

12 OA LOCKBUSTER 


3 00 


CTGj 

7.ti2MK MATCH 


460 


CTO. 

&MM JHP 


la,000 


CTO. 

BAImL 


la,000 

A475 

CTG. 

-45 CAIf BALL 


£,000 

AX14 

shots™ PfilMEE 


1,000 

DHBS 

DIVERSIONARY CHARGE MR 141 MOD 0 


500 

L302 


CTO, WHITE FLARE 


350 

L204 

SIG 

CTO. GREEN FLARE 


350 

L32g 

sio 

GTG. RED FLARE 


350 

LXll 

SIGNAL ROCKET LAUNCHER KIT 


SO 

M031 

CHO. 

DEMO, BLOCK. THT 1/2 LB 


1R2 

M032 

CHG. 

DEHO, BLOCK. THT 1 LE 


P6 

M033 

CHG. 

DEMO. CRATERING 


20 

rtlio 


BLAST ELEC 


2ea 

M121 

CAP. 

BLAST MON-ELEC 


460 

H4S« 

com? 

. DETONATING. REIN 

2, 

000 FT 


PU2E 

, BLASTING, TINE, EXPLOSIVE LOADED 


aoo FT 


IGNITER, TIME BLASTIMC 


300 

Msao 

CHG. 

DEMO, EXPLOSIVE SHEET 36 FT PER ROLL 


2 RO 


CHG. 

DEMO, E2CFLOSIVE SHEET 25 FT PER ROLL 


2 RO 

H9B2 

CHO. 

DEMO, EXPLOSIVE SHEET IS FT PER ROLL 


2 - RO 

MLU3 

FIRING DEVICEH DEMO. MULTIPURPOSE H124 


1X2 

HN30 

GHG^ 

DEHO 2D GRAMS 


200 

P4W4:i 

CHG, 

DEMD, SHAPED, FLEK LIH 30 GR/FT. 6 FT 

LENGTH 

12 

MJyl42 

CUGr 

DEMO, SHAPED, FLEX LJN 40 GE/PT, $ FT 

LENGTH 

32 

MN43 

CHG, 

DEHO, SHAPED, FLEX LIH 60 GE/PT. S FT 

LENGTH 

12 

hM44 

CHQ. 

DEMO, SHAPED, FLEX L3H 7S GE/PT. £ FT 

LENC™ 

26 

MW4& 

CHGr 

DEMO, SHAPED, FLEX LUT 100 GR/FT. G FT 

LENGTH 

12 


CHG, 

DEMO, SHAPED, FLEX LIN 225 GR/FT, G FT 

LENOTH 

12 

HM47 

CHG, 

DEMO, SHAPED, FLHK LIN 4DO GR/FT. G FT 

LENGTH 

12 

MM4S 

CHS, 

DEMOh shaped, flex LIN £00 GR/FT. £ FT 

LENGTH 

13 

HMSe 

DUAL 

. LEAD NONEL PRTMADET. 175 MS DELAY. 100 

FT LENGTH 75 

rtN14 

FIRING DEVICE HAND HELD MK 54 


S 


Figure 26. EX-3, Usage Table as an Image (from [2]) 
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ARTILLERY ANCILLARY HEMS 


UCOBD10.1E 

UPB 1 5 tSS7 


I PROJECTILE 

ANCILLARY ITEM || 


NcKnandatifra 


Momsndatuna 

Mgiliilier 


PHCXI 155MM. ADMA-l, MSM 

0633 

CHfi PROP 15SMM, R9/WB. Ml 19AW 

0.27 



0544 

CHC PROP 1SSUM, GREEN BAG. M3 

0.15 



0541 

CMG PROP f$SMM, WHITE BAG, M4 

O.Sfl 



N2:aa 

FUZE. ET, M76a 

(T4Z55s FI, MT-SQ, M577 rusy substitute) 

1.05 



N523 

PRIM^, PETCUSSIOM. MBZ 

1.1 


PROJ 1&5MM. ADAM^. M731 

□533 

CHG PROP 155MM. Rfl/WS. Ml l&AV 

027 



□540 

CHO PROP 15SUM. GREEH BAG, M3 

0-15 1 



0541 

CHO PROP leSMM. WHITE BAG, M4 

OS0 



ruao 

FUZE. ET. Mrez 

(NZfl5, FZ, MTSO, M577 mcy sutMlitut#) 

1.05 



NS23 

PRIMER. PERCLfSSlOH, mZ 

1 1 

. 05CS 

PROJ 1&5MM, ILLUM. h1485A2 

0533 

CHG PROP t5SMM, RB/WB. Ml 19A1^ 

0.27 



0540 

CHG PROP 155UM, GREEN BAO^ iVtO 

0.15 



0541 

CHG PROP 15SMM, WHITE SAG, M4 

o.sa 




FUZE, Et, M7Ba 

(r4255, FZ, MTBO, MS77 bley subalitye#) 

1.05 



n523 

{NZ4$, FZ. mT, MSjfiS may substjUJUe) 
PRIMER. PERCU$$E0N, M52 

1.1 

' D510 

PROJ 1&5lyiM, COPPERHEADS M71Z 

D533 

CHG PROP 155MM. RD/WB. M119Alf 

0.27 



0540 

CHG PROP 155MM. GREEN BAG. M3 

0 15 



D541 

CHG Prop i65mm, white bag. u* 

o.w 



ri523 

PRIMER, PERCUSSIOH, UK 

1.1 

□S14 

PROJ 155WM, RAAM4_ 

[ 0533 

CHG PROP 155MM, RB/WB, Ml ISAf/ 

0.27 



□540 

CHG PROP 155MM, GREER BAG, M3 

0.15 



0541 

CHG PROP 155MJUI, WHITE BAG. U4 

0.6S 



R2B9 

j 

FUZ^. EJ. M7H 

(N3B5. FI. MFSQs U577 may substitute) 

1.06 



! N5Z3 

PRIMER. PERCUSSION, Ma2 

1.1 

DSIS 

P~R^ I^M. RAAIl^-S 

j 0533 

CHG PROP 155MM, RfiAWB. Ml iflAlf 

02T 



; D540 

CHG PROP ISSMMs GREPW BAG. MS 

0.15 1 



1 D541 

CHG PROP 1&5MU. WHITE BAG. M4 

0.6S ■ 



1 NZB9 

FUZE, ET, M702 

;(K26&. FZ. MTSQ, M57T may subslllule) 

l.OS 

i 


1 NSZ3 : 

PRIMER. PERCUSSION. MB2 

1.1 : 

"EsiTj 

PROJ 155MM. $MOKE. WP, Mfl25 

1 0532 

CHG PROP 155MM. RED BAG. Ma03 




, D533 

iCHG PROP 155MM. RBA^. M11MU 

0.2 1 

1 


1 E>541 

ICHG PROP 1S5MH. WHITE BAG. M4 

, C.7 ' 

1 


I 

i 

IFUZE, ET, M7S2 

(N26G, FZ, MTSQ, MST7 may subsllltite) 

1.05 ' 



: N523 

PRIMER, PERCUSSION. Ma2 

1.1 


ENCLOSURE {«) 

Figure 27. EX-4, Usage Table as a Table (from [2]) 
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2 . 


OCR Applications 


The following off-the-shelf programs were compared: 

• http://www.onlineocr.net/. Free, open-source, and online OCR 
application. The primary focus of this application is to conduct OCR. 
Available in guest and member modes: 

• Guest Mode. Accepts PDF and image (JPG, BMP, TIFF, and GIF) 
with a maximum file size of 5 MB. Output file types are MS 
Word® .docx, MS Excel® .xlsx and plain text .txt. Conversion is 
limited to 15 documents (single page) per hour. 

• Member Mode. Accepts the same inputs as guest mode. Output 
file types are PDF, MS Excel® .xls and .xlsx, MS Word® .doc and 
.docx, plain text .txt, and a RTF document .rtf Maximum input file 
size is increased from 5 MB to 100 MB. New members have a 25- 
page credit. Once this limit has been reached, additional pages 
must be purchased. The price-per-page decreases when bulk 
amounts are purchased. For example, purchasing 50 pages has a 
cost of 10 cents per page ($4.99) whereas purchasing 50,000 pages 
has a cost of 0.4 cent per page ($199.95). 

• Microsoft OneNote®. Commercial software published by the Microsoft 
Corporation. Sold stand-alone or as a part of the Microsoft Office Suite®. 
The primary focus of this application is to provide a workspace for the 
user to collect notes and organize documents. OCR is a feature within 
OneNote®. Accepts any input file on the Windows OS: images, PDF, MS 
Office® document extensions, text files, etc. Output file extensions are: 
.doc, .docx, .txt, and .pdf 

• Nuance OmniPage®. Commercial software published by the Nuance 
Corporation. The primary focus of this application is to conduct OCR. 
Accepts digital camera images, images (JPG, BMP, TIFF, GIF, PNG), and 
PDF. There are over 50 different output file extensions that fall into eight 
categories: HTML, MS Excel®, MS Word®, MS PowerPoint®, PDF, 
RTF, Unicode Text, and XML. 

Table 1 is a summary of the input file types accepted by the applications. 

Likewise, Table 2 illustrates the output file types that they are capable of producing. 
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onlineocr.net 

Microsoft 

OneNote® 

Nuance 

OmniPage® 

Guest 

Member 



Images 

(.jpg, .gif, .tiff, .png) 

y 

y 



PDF 

(.pdf) 


y 



Text 

(.txt) 





MS Word® 

(.doc, .docx) 





MS Excel® 

(.xls, .xlsx) 





Digital Camera 

(Direct input) 





Totals 

2 

2 

5 

3 


Table 1. OCR Input File Types 
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onlineocr.net 

Microsoft 

OneNote® 

Nuance 

OmniPage® 

Guest 

Member 



PDF 

(.pdf) 


y 



Text 

(.txt) 





MS Word® 

(.doc, .docx) 





MS Excel® 

(.xls, .xlsx) 

y 


y 


MS PowerPoint® 

(•PPt) 





Rich Text Format 

(.rtf) 


y 



HTML 





XML 





Unicode Text 




y 

Totals 

3 

5 

4 

9 


Table 2. OCR Output File Types 
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Since the primary focus of OCR is detecting and transcribing text from images, it 
is unremarkable that each of the applications accept image fdes as input. However, each 
of them accepting a PDF file is important since this is the format in which the 
consumption documents will most likely be available. Also, it is important to note that 
OneNote® is capable of accepting many additional input fde types because it conducts a 
fde-to-image conversion of all input documents. For example, when a Word® document 
is placed into OneNote®, it represents the document as an image in the note. OCR must 
then be conducted on the image to extract the text. 

Although all of the applications are capable of creating Word® and Excel® 
outputs, the text fde extension, .txt, provides the most flexibility for developing the 
conversion programs. Creating a program to accept Word® and Excel® files adds no 
extra functionality and often requires unnecessary libraries. Therefore, this thesis will 
create a text fde with a .txt extension as the output format of the OCR applications for 
later use in the data extraction programs. 

3. Online, Open-Source OCR 

The first application we tested was the open-source application available at 
http://www.onlineocr.net/. In order to maximize the number of documents that could be 
tested, the guest mode was utilized. The member mode offered no further extension of 
capability that would have been beneficial to the discussion. Figure 28 illustrates the 
main page of the application as of 8 August 2014. 
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FREE ONLINE OCR SERVICE 

Use Optical Character Recognition software online. Service supports 
46 languages including Chinese, Japanese and Korean 

CONVERT SCANNED PDF TO WORD 

Extract text from PDF and images (JPG, BMP, TIFF, GIF) and convert into editable 
Word, Excel and Text output formats 



1 STEP ' Upload file 


2 STEP - Select language and output format 

ENGLISH Mrcrosofl Word (docx) 


3 STEP - Convert 




Max file size 5 mib. 


1 



Allows 1 file to be 


entered with a maximunn 

V. 

size of 5 MB. 

J 


Offers support for 46 
languages. 


J 16 


Output file types: .docx, 
.txt, .xlsx 


Figure 28. http://www.onlineocr.net: Main Page and Features (after [9]) 


Although the application claims that it can convert a file up to 5 MB, it is 
important to note that it will only convert one page. If the input is a multi-page PDF, it 
will only convert the first page. Therefore, it was presented with each page until all the 
pages had been converted. The application converted EX-1 with zero errors. This result is 
unremarkable because the page was previously prepared using text editing software and 
presents a very clear input document. Figure 29 illustrates the post-OCR results of EX-1. 
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DEPARTMENT OF THE NAVY 

HEADQUARTERS UNITED STATES MARINE CORPS 
WASHINGTONr DC 20Q3(H]001 


mo SOIO.IE 
C 392 
15 Apr 97 

CORPS ORDER SQ1Q.1E 


Frcm: Ccmmandant of the i^arine Corps 

To: Di31 ribu tion List 

Subj: CLAES V(W) PLANNING FACTORS FOR FLEET MARINE FORCE COMBAT 

OPERATIONS 

Ref^ (a) Marine Corps Ground Affummit ion War Materiel 

Requirement (WMR) Determination (1995-1996) Study 
Final Report (NOTAL) 

(b) MCO P4400-3 9G (NOTAL) 

(c) FMFM 4-1 

(d) FM 9-6 

(e) FM 9-13 

Enel: (1) Ei^lanation of the Scenario-Base Gc-iribat Planning 

Factors Tables 

(2) Infantry-Heavy Threat Combat Planning Factors Table 

(3) Armor-Heavy Threat Combat Planning Factors Table 

(4) Composite Combat Planning Factors Table 

(5) Combat Planning Factors for Special Operations 

(6) Artillery Ancillary Items 

1- Purpose- To pr omul gate Class YIW) ccarbat planning factors (CPF^s) to 
support Fleet Marine Force (EMF) combat operations- 

2- Cancellation. MCO SO 10.ID. 

3. Background. Reference (a) reports the results of the Marine Corps Class 
V(W) WMR Study (1995-1996). Reference (b) establishes Marine Corps policy 
governing requirements determdnation, acquisition, management, and 
distribution 

qZ. war reserve materiel- References (c), (d) and (e) provide logistical 
doctrine and associated tactics, techniques, and procedures for 
Class YIW) support during combat operations. 

4- Planning Factors- Factors to be used during initial planning for combat 
operations are explained in enclosure (1) and shown in enclosures (2) through 
(5) - 


a- CPF^3 reflect the anticipated expenditure of ground ammunition over 
designated time periods of combat operations. These rates represent the 
unconstrained requirement- Unconstrained requirements are based on approved 
force structure, weapon mix, anticipated duration of combat, and the 
anticipated intensity of conflict- Once Version 2.1 of the Ammunition 
Prepositioning and Planning System (APPS) is fielded, the APPS 

DISTRIBUTION STATEMENT A: Approved for public release; distribution is 
unlimited- 


Figure 29. Post-OCR Results of EX-1 using onlineocr.net (after [2]) 

EX-2 was tested next. Figure 30 illustrates the post-OCR results with errors 
highlighted in red. 
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Ii^af*r¥4Heavy TlreatCocnbat Plariwig Factors TaUe 
Wrapon ID S^f£ncE 


Other thanGCE 


Weapon 


WeaponID 


DO PIC 


QCE RATES 

Daay Daily Basic D^ D^y Basic 

ASSAULT SLBTAIN ASsAuli SUblAlh 


fflXTl ajtlAD DB^OUnONSET 
B0471 SqtJAD DB^UnONSET 
SQtJAD D&fOanONSET 
ajllAD DB^OUnONSET 
SqtJAD DB^UnONSET 
ajUAD D&fOLrnONSET 
SgilAD DB^UnONSET 
SqtlAD DEMOUnONSET 

DB«LrnON EqUlPIW-lNOrVIDUAL 
D&tOUnON EQUIPmTlIDSrVIDUAL 
D&fOUnON EQUIPmTlNDfrVIDUAL 
D&KJUnON EQLnPIWTlNDirVIDIJAL 
D&tOUnON EQULPIWTINDrVIDUAL 
DBfOUnON EOUIPmTINDrVIDUAL 
DBWUnON EqU]P^«TINarVIDUAL 

COMBAT MOB]lJTY VBCCLE 
COMBAT MOBILITY VBCCLE 
COMBAT MOBILITY VBTICLE 

LENE CHG LAUHCH KIT TRLR MTD 
LIME CMC LAUPICH KIT TRLR MTD 

LENECHGLAUNCHBIF/ AAVP7A1 
QNECHGLAUNCIMBIF/ AAVP7A1 



E015D BRID<IARMOREDLALfKHER 


M032 CrtARil, DB« BLOCK 1 LB TMT 

M130 CAP^ BLASniMG B_ECTRIC 

M131 CAP. BLASUNG NON-B_BCIRIC 

M456 CORD.DEIONATEMG PETN 

MS7D FtE^, BLASnMG TIME 

M 757 CKARitS. ASSEMBLY DBWLITION 

M76S KMTTH^, TIME FUSE BLASmiG 

M LOS FIRING DE\6 DBMO MULTIPURPQSE 

M032 CKARitI, DEMO BLOCK 1 LB TMT 

M131 CAP. BLASUNG NON-B-ECIRIC 

M455 CO RO. DETONATI NGH 

M670 FtEE, BLASUNG TIME 

M 757 CHARS!. ASSEMBLY DB40LITION 

M7SS latTTH^. TIME FUSE BLASnNG 

M L03 FIRING DBf. DB40 MULTEPURPQSE 

A131 CTGr7,eaMM4&lLIMKH> 

3542 CTG40MM^HL[NKH>PORMK1P 
G32S GRENADEtLAUNOHERSMOKEIR 

J143 ROCKET MOTOR. 5 INCH 

M913 CHARGE.DB40 LINEAR HE 

J143 ROCKET MOTOR. 5 INCH 

M L25 CHARiC!. DB40 LINEAR HE LVT 

G32S GRENADE. LAUNtMERSMOKEIR 


1501753 

1301945 

50.00000 

1401^1498 


1501753 

90.00000 

0.47S73 


3.39605 

4.00000 

350]a00 

34045345 

21QO]aOO 

3.39S05 

5401000 

0.30030 


15.00753 3.39605 15 

13.01945 2.03034 150 

60.00000 36.00000 130 

1500 140130493 34045345 1500 

3000 33000000 21600000 1500 

15.00753 3.39605 50 

60.00000 36.00000 130 

0.47673 0.30030 12 


150 

260 


50 

390 


20.00000 

2LL5S65 

Uim¥¥1 

2.11527 

30.00000 

1.05755 


30.00000 

0.43303 

7.00000 

0.43303 


20 

500 

450 

4 

30 


2.66024 0.13644 

20.00000 5.00000 

2LL55965 43.93003 

12000000 30.00000 

2.11527 0.43303 


30.00000 

1J0S755 


7.00000 

0.43303 


60 

500 

500 

4 

50 


29.13643 

1594271 

279QS1 


UjDSAH 

L31975 


1000 

364 


0.70035 

0.70035 


0.13273 

0.13273 


0.70035 0.13273 

0.70035 0.13273 


2.36044 

2.36044 


a4?]ao 

a47[]00 


E0207 COMMAND LAUNCH UMIT.JAVHJN 


JAVL JAVELIN 


0.21395 0.04515 


E0665 

E0665 

E0665 

E0665 

60665 


DSKJIATH) MARJSMAN RIHE 
HOWI1ZBIMS>TOWH)155MM MlflS 


HOWTlZm MS> TOWH) 
HOWU^ ME> TOWH> 
HOWnZBl MHJ TOWH> 
HOVyOT^ ME> TOWH) 
HOWU^ MS> TOWHJ 
HOWnZBl MEJ TOWH> 


155MM MlflS 
155MM MigS 
155MM MigS 
155MM MigS 
155MM MigS 
155MM MigS 


D003 

D501 

D502 

D505 

D510 

D514 

D515 


ClGr7.fiaHM BALL MATCH 
CHARi!, SPOTTING PROJECTILE 

PROJ. 155MM APBS ADAM-L 
PRQJ. 155MM APBS ADAM^ 
PRQJ.155MM ULUM 
PROJ^ 155MM COPPERHEAD 
PRQJ.155MM RAAM5 
PRQJ^155MM RAAM-L 


3.67150 21113173 100 

0.61270 004630 3.16 


0.43155 

0.95734 

0.27753 

0.05965 

1 j 0O466 

0.15045 


013044 

O40QQ9 

011676 


0.23352 

0.36153 


0.73 

1.46 

0.7 

0.1 

0.6 

0.23 


I 

J 

■ 


Figure 30. Post-OCR Results of EX-2 using onlineocr.net (after [2]) 


When OCR was conducted on EX-2, 28 errors were encountered: 

• Five errors occurred in the conversion of text. 

• 23 errors occurred in the conversion of numbers. 

• Of the 23 errors that involved numbers, 16 of those errors occurred in the 
first column, “Weapon ID.” 

The reason for such a high error rate in the “Weapon ID” column can be 
attributed to the applications inability to distinguish the difference between the letter “B” 
and the numbers “6” and “8.” This was most likely caused by the fact that the letter “B” 
has rounded corners and appears fiizzy in the input document. This is normal and an 
expected degradation of a document whose original publish date was 15 April 1997. 
Also, the enclosures may have been adopted from a document that was created before the 
source document came into existence. An important finding is that the online OCR 
application was intelligent enough to determine that the data, although presented in a 

vertical fashion, was best suited for representation as a table in a horizontal view. Thus, 
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the application took the page that was originally presented in a profile view and presented 
its output as a file in landscape view. 

EX-3 was tested next. Figure 31 illustrates the post-OCR results with errors 
highlighted in red. 


TABLE Z. COMBAT PLANNING FACTORS FOR SPECIAL OPERATIONS 


AOll 

CTG, 

12 GA 00 BUCK 

480 

A014 

CTG, 

12 GA 7.5 SHOT 

480 

A023 

CTG, 

12 GA SLUG 

250 

AO 2 4 

CTG, 

12 GA LOCKBUSTER 

300 

A136 

CTG, 

7.62MM MATCH 

460 

A260 

CTG, 

. 9MM JH P 

18,000 

A3 6 3 

CTG, 

9MM BALL 

18,000 

A475 

CTG, 

. 45 CAL BALL 

6,000 

AX14 

SHOTGUN PRIMER 


OWES 

DIVERSIONARY CHARGE MK 141 MOD 0 

500 

L302 

SIG 

CTG, WHITE FLARE 

350 

L30 4 

SIG 

CTG, GREEN FLARE 

350 

L328 

SIG 

CTG, RED FLARE 

350 

LZll 

SIGNAL ROCKET LAUNCHER KIT 

50 

MO 31 

CHG, 

DEMO, BLOCK, TNT 1/2 LB 

192 

M]032 

CHG, 

DEMO, BLOCK, TNT 1 LB 

96 

M039 

CHG, 

DEMO, CRATERING 

20 

M130 

CAP, 

BLAST ELEC 

288 

M131 

CAP, 

BLAST NON-ELEC 

480 

M456 

CORD, DETONATING, REIN 


M670 

FUZE, BLASTING, TIME, EXPLOSIVE LOADED 

800 FT 

M766 

IGNITER, TIME BLASTING FUZE _ 

_300 



ML03 

MM30 

FIRING DEVICE, DEMO 

CHG, DEMO 20 GRAMS 

', MULTIPURPOSE Ml24 

112 

200 

MM41 

CHG, 

DEMO , 

. SHAPED, 

FLEX 

LIN 

30 GR/FT, 6 FT LENGTH 

12 

MM42 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

40 GR/FT, 6 FT LENGTH 

32 

MM43 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

60 GR/FT, 6 FT LENGTH 

12 

MM44 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

75 GR/FT, 6 FT LENGTH 

28 

MM45 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

100 G^FT, 6 FT LENGTH 

12 

MM46 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

225 GR/FT, 6 FT LENGTH 

12 

MM47 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

400 GR/FT, 6 FT LENGTH 

12 

MM48 

CHG, 

DEMO, 

. SHAPED, 

FLEX 

LIN 

600 GR/FT, 6 FT LENGTH 

12 

MM56 

MN14 

DUAL 

FIRING 

LEAD NONEL PRIMADET, 175 MS DELAY, 100 FT LENGTH 
DEVICE HAND HELD MK 54 

75 


Figure 31. Post-OCR Results of EX-3 using onlineocr.net (after [2]) 


When OCR was conducted on EX-3, 37 errors were encountered: 

• 33 errors occurred because the application omitted three entire lines of 

data. 
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• Two errors occurred in the conversion of text. 

• Two errors occurred in the conversion of numbers. 

An important finding is that the OCR application did not place the data elements 
into a table. The application interpreted the page contents as an image, rather than table. 
However, an appropriate amount of space was placed between the data elements for 
readability. The cause for the omission of three lines of data is unknown. 

EX-4 was tested next. Figure 32 illustrates post-OCR results with errors in red. 


ARaiLlBlYANaLURY ITEMS 


HCO eolo IE 

APR IS mi 


PROJECTILE 

ANCILLARY ITEM 

DO Die 

Nomenclature 

DO Die 

Nomenclature 


D501 

PROJ 1I55HH, ADAH-L, H692 

D533 

CHG PROP 1EEMM, M119A1J 

0 27 


D540 

CHG PROP 155HH, GREEN BAG, H3 

0.15 



D541 

CHG PROP 155HH, WHITE BAG, m 

o.oe 



N239 

FUZE, ET, H762 

(N2S5, FZ, WTSq, H577 may Bubstitutet 

1.05 



NE23 

PRIMER, PERCUSSION, Me2 

1.1 

D502 

PROJ 155HH, ADAM-S, H731 

D533 

CHG PROP 1EEMM, RB/^1, M119A1I 

0 27 



D540 
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D541 

CHG PROP 155HH, WHITE BAG, m 

0.63 



N239 

FUZE, ET, M7e2 

(N2S5, FZ, MTSO, M577 may Bubstituteli 

1.05 



NE23 

PRIMER, PERCUSSION, Me2 

1.1 


PROJ 15SMM, ILLUM, U4SSA2 

D533 

CHG PROP M119A11 
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CHG PROP 155MH, GREEN BAG, M3 

0.15 



D541 

CHG PROP 155HM, WHITE BAG, M4 

0.66 



N2e9 

FUZE, ET, M7e2 

{N2B5, FZ, MTSO, M577 may substitute) 
(N24e, FZ, MT, MESS may substitute) 

1.05 



N523 

PRIMER, PERCUSSION, M82 

11 

D510 

PROJ 155HH, COPPERHEAD, H7I2 


CHG PROP 1EEMM, Ml 19All 

0 27 



D540 

CHG PROP 1EEMM, GREEN BAG, M3 

0.15 



D541 

CHG PROP 1EEMM, WHITE BAG, M^ 

0.66 



N523 

PRIMER, PERCUSSION, M02 

11 

D514 

PROJ 1E5HH, RAAH-L 

D533 

CHG PROP 1EEMM, M119A1I 

0 27 



D540 

CHG PROP 1EEMM, GREEN BAG, M3 

0.15 



D541 

CHG PROP 1EEMM, WHITE BAG, M4 

0.66 



N239 

FUZE, ET, M7e2 

(N2SE, FZ, MTSO, ME77 may substitute) 

1.05 



N523 

PRIMER, PERCUSSION, Me2 

1.1 

D515 

PROJ 1IE5HH, RAAH-S 

D533 

CHG PROP lEEMh^^^H, M119A1; 

0 27 



D540 

CHG PROF 1EEMM, GREEN BAG, M3 

0.15 



D541 

CHG PROF 1EEMM, WHITE BAG, m 

0.66 



N239 

FUZE, ET, M762 

(N2SE, FZ, MTSO, ME77 may substitute) 

1.05 



N523 

PRIMER, PERCUSSION, H02 

11 

D52S 

PROJ 155HH, SMOKE, WP, He25 


CHG PROF 1EEMM, RED BAG, M2Ci3 

0 2 



D533 

CHG PROP lEEMh^^^l, M119A1; 

0 2 



D541 

CHG PROF 1EEMM, WHITE BAG, m 

0.7 



N2e9 

FUZE, ET, M7e2 

(N2eE, FZ, MTSO, ME77 may substitute) 

1.05 



N523 

PRIMER, PERCUSSION, Me2 

1.1 


ENCLOSURE (6) 

Figure 32. Post-OCR Results of EX-4 using onlineoer.net (after [2]) 
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When OCR was conducted on EX-4, 12 errors were encountered: 

• Five errors occurred in the conversion of text. 

• Seven errors occurred in the conversion of special characters. 

The application was unable to determine the difference between the letter “D” and 
the number “0” due to degradation of the source document. The application was unable to 
determine the difference between the forward slash character “/” and the letter “1.” An 
important finding is the fact that the application recreated the table structure from EX-4 
near-perfectly. 

Table 3 illustrates a summary of the accuracy rate and important findings for the 
open-source application. 


Page 

Total 

Errors 

Word 

Errors 

Number 

Errors 

Cause / Findings 

EX-1 

0 

0 

0 

• 100% accuracy rate of conversion may be 

attributed to the previous use of text 
editing software. 

EX-2 

28 

5 

23 

• Problems distinguishing the letter “B” 
from the numbers “6” and “8.” 

• Text converted from portrait view to 
landscape. 

• Data placed in a table data structure. 

EX-3 

37 

5 

32 

• Text interpreted as an image rather than a 
table. 

EX-4 

12 

8 

4 

• Near-perfect table recreation 

• Problems distinguishing the letter “D” 
from the number “0.” 

• Problems distinguishing the special 
character, forward slash “/” from the letter 

6CJ 99 

Totals 

77 

18 
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Table 3. Online, Open-Source OCR Results and Findings 
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Overall, the online and open-source OCR application demonstrated the ability to 
convert the original consumption documents into useful output fdes. It also had the 
ability to recreate tables and recognize when data is best presented in other formats. For 
example, converting the data contained in EX-2 from a vertical profile view to a 
horizontal landscape view is helpful. Based on the findings for this application, the 
following scores were given: 

• Accuracy: 2 

• Consistency: 3 

• Speed: 1 

• Ease of Use: 3 

• Functionality: 1 

The accuracy rate of the application was manageable and consistent. The 
application suffers in speed, limiting the user to 15 pages per hour which would be 
mitigated by using the member mode albeit while impacting the cost. Using and 
understanding the functionality of the program can be accomplished in five minutes. The 
application provides the least amount of input and output formats and has no 
spellchecking ability. Once OCR has been completed, the user must open the output file 
in a text editor to review and correct its contents. 

4. Microsoft OneNote® OCR 

Microsoft OneNote® was tested next. Figure 33 illustrates the post-OCR results 
ofEX-1. 


DE PARTMENT O F THE IM AVY 
HEADQUARTERS UNITED STATES MARINE CORPS 
WAS H IN GTON, DC 20380-0001 

Figure 33. Post-OCR Results of EX-1 using OneNote® (after [2]) 

Although OneNote® was given the same page that the online and open-source 
was given, it was unable to OCR the document past the first three lines. This is 
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remarkable beeause it represents the first major failure of an OCR application to 
successfully convert an input document. To explore the significance of different file 
formats, the input document was converted from PDF to a JPG. Figure 34 illustrates the 
second round of OCR testing conducted on EX-1 in an image format. 


DEPARTMENT OP THE NAVY 

MtAJUU*J{l ills UNIICO51*155 5*4155 CUera 

W*SMINCICS, OC 2VSIU4ULt 

MIO 6010IE 

C092 

15 Apr97 

NASILIE CORPS 0€Dtl 5010.15 
From Coneendantofthe Mernnetorpe 
To: Distribuftion List 

Sub] CLASPV (W) PtAlSLL NG FACTORS FOR FttEtMPR \EX FORCE CCFSIAT 
OPERA? IfSlS 

Rat: la) Marina Carp. Uraund Mnunitlen WarMaterialj 
ReguiranentlWARI Detemination {1SS5-1SS6) Study 
Final Report IWTAt) 
lb)MCO DWOSSe (NOtal 

l r) SWIM 4-1 
Id)574 9-6 

l s) SW 9-13 

anti: li) Nxplanatinn at iba Sranarib-5.aaa Cat Planning 
Factors Tables 

2) I nf a ntry-H eavy Th re at Co n b at PI a n ning P a eta re tab I e 
II) Arnor-HeavyThraat CneatPlanningFactorsTabla 
li) Canpoatte Ccnbst Planning Factors taSte 
IS) Conbst Planning Factors {orSpecial Operat ions 
161 ArtilleryAnclllaryttens 

Figure 34. Post-OCR Results of EX-1 using OneNote®: Second Pass (after [2]) 


Without listing the entire contents of the output, we can quickly see that the 
output is highly inaccurate both in spelling and format. Thus, in order to make an 
accurate output file, the post-OCR results would need to be heavily corrected. OneNote® 
has the built-in functionality of a spellchecker. This can be leveraged to correct the errors 
and alleviate some of the burden on the user, however, the user must select this option 
since it is does not turn on automatically after OCR is complete. 

EX-2 was tested next. Figure 35 illustrates the post-OCR results. 
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Figure 35. Post-OCR Results of EX-2 using OneNote® (after [2]) 

When OCR was conducted on EX-2, OneNote® was unable to accurately process 
the input. Although the document was converted to an image and placed back in 
OneNote® in a landscape view as a secondary test, the same result was encountered. 
Thus, heavy modification or reformatting of the input document would be necessary to 
properly process the document. 

EX-3 was tested next. Figure 36 illustrates the post-OCR results. 
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MCOSOIO.IE 

Z CO MB AT P LAN N BN G F ACTO RS FO R .SP ECIA LO P E RATIO NS 
QUANTTfY PER 

DO Di e NOMENCLATURE MEU t SOC) 

I CT^G, 12 GA(MJ BUCK 460 
CTG. 12GA7.5 SHOT 400 
I CTG, 12 G ASLUG 250 

jlZGA^^^^B™ 

|9MM BALI 1^000 
|CTG. .4SCAL BALLBiOOO 
AX14SHOTGUN PRIMER l,OOOi 
DWB5 DIVERS IONARY CHARGE 14K 141 NIOD 0500' 

WHITE FLARE 550< 

GREEN FLARE 550 
LS2S flare 5^ 

\X11 SIGNAL ROCKET LAUNCHER KfTSO 

I DEMO, BLOCK. TNT 1/2 L£ 192 
DEMO, BLOCK. TNT 1 LB 96 
DEMO, CRATE RUNGD 
ICAP. BLAST ELEC ^ 

CAP, BLAST NON- ELEC 400! 

CORD, DETONATING, RE IN 2,000 PT _ 

FUZE, BASTING,TINE,EXPLOSIVE _ 

IGNITER, TIME B ASTINGP^ 300 
C R6, DEMO, XXP LOSIVE .SHE ET 30 FT P E R RO LL 2 RO 
CRO, DEMO, EXPLOSIVESKEET25 FT PER ROLL 2 RO 
C HG, DE MO, EX P LOSIVE SHE ET 19 FT P ER RO LL 2 RO 
I FIRIN G DEVICE, DEMO, MULTIPURPOSE M124 112 
, DEMO 20CRAMS200 
.DEMO, SHAPED, FLEX 
I DEMO,SHAPED, FLEX 
I DEMO,SHAPED. FLEX 
DEMO,-SHAPED, FLEX 
. DEMO, SHAPED. FLEX 
,DEMO, SHAPED, FLEX 
CHG, DEMO,SHAPED, FLEX 
DEMO, SHAPED, FLEX 
DUAL LEAD NONELP RIMADET, 175 
I FI RIN G DEVICE HAN D ME LD 14K 54 S 
ENCLOSURE (5) 

2 


1000 FT 



6 PT LENGTH 12 
. 6 FT LENGTH 32 
6 FT ^^>12 

t 6 PT^B 20 

6 FT LENGTH 12 
6 PT LENGTH 12 
6 FT^^|l2 
6 FT LENGTH 12 
DELAY, 100 FT LENGTH 75 


Figure 36. Post-OCR Results of EX-3 using OneNote® (after [2]) 


When OCR was eondueted on EX-3, 47 errors were eneountered: 

• 32 errors oeeurred in the eonversion of numbers. 

• 14 errors oeeurred in the eonversion of text. 

• One error oeeurred in the eonversion of the speeial eharaeters. 

The majority of the errors oeeurred in the first eolumn where the DODIC is 
represented by alphanumerie eharaeters. The seeond region that eneountered the most 
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problems also involved alphanumerie eharaeters. An important finding is that OneNote® 
provided very little in the way of formatting. The original input had large areas of white 
spaee to provide readability whereas OneNote® left-aligned the majority of the document 
and removed this white space, making the output difficult to read for an end-user. 

EX-4 was tested next. Figure 37 illustrates the post-OCR results. 


ARTILLERY ANCILLARY tTEMS 

MCCSOIO.IE 

APR15 

OS02P ROJ155MM. ADAM-S. Wr731 

'A. NCILIARYITEM 

omedaure 

CHGPROP15544M, RBJWB, MIIBAI! 
cm PROP 155MM, GREEN BAG, M3 
CHG PROP 155MM. WHITB BAG. M4 
FUZE, ET. 14762 

(N2fi5, FZ MTSO, 14577 may'substitute) 

PRIMER. PERCUSSION. 1452 
C)IG PROP 1554^1, RBIWB MliqAl! 

C140 PR0P 155^. GREEN BAG. M3 
CHO PROP 15544M, WHflB BAG. IM 
FUZE. EF. 14762 

N2fi5, FZ MTSO. 14577 rn.yaJbstktlite) 

PRIMER. PERCUSSION, 1482 
CHG PROP 155MM. RBWS, MllBAl! 

CHG PROP 155MM, GREEN BAG, M3 
CHG PROP 155MM, WHIT6 BAG, M4 
FUZE.. ET, 14762 

(N2a5, FZ MTSO, 14577 may' sub^itute) 

(N24S. FZ MI, M565 m a/substitute) 

PRIMER. PERCUSSION, 1482 

0533 OIG PROP 155MM. RBW, Ml 19A1! 

0640CItOPROP15544M, GREEN BAG. M3 
0541 04G PROP 15544M, WHITE BAG. 144 
N523 PRIMER. PERCUSSION M62 
□533 CHG PROP 15544M, RB{W8,14113M! 

0640 04G PROP 15544M. GREEN BAG. 143 
0541 CHG PROP15644M, WHITE BAG, 144 
N289 FUZE. ET. 14762 

Figure 37. Post-OCR Results of FX-4 using OneNote® (after [2]) 


When OCR was conducted on FX-4, two errors were encountered that involved 
the conversion of text. Although the input had been previously prepared using a text 
editor and was presented in a highly structured table format, OneNote® failed to recreate 
the table and present the data in a useful format. Table 4 illustrates a summary of the 
accuracy rate and important findings for OneNote®. 
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Page 

Total 

Errors 

Word 

Errors 

Number 

Errors 

Cause / Eludings 

EX-1 

> 100 

> 100 

> 100 

• Unable to convert the input document past 
the first three lines of data. 

• Native spellchecking capability 
discovered. 

EX-2 

> 100 

> 100 

> 100 

• Complete failure to detect input layout 

EX-3 

47 

14 

32 

• Words comprised of alphanumeric 
characters represented 46 out of 47 errors. 

• Problems distinguishing the special 
character, forward slash “/” from the 
special character exclamation point “!.” 

EX-4 

2 

2 

0 

• Failure to recreate table for readability. 

Totals 

> 100 

> 100 

> 100 



Table 4. OneNote® Results and Findings 


In summary, OneNote® was incapable of accurately conducting OCR on EX-1, 
EX-2, and EX-4. OneNote® comes with the functionality of spellcheck but does not 
provide the functionality to apply formatting to the OCR output. While the program 
offers unlimited OCR capability, it must be purchased in order to do so. The main benefit 
of the program remains focused on its ability to quickly and efficiently take notes and 
requires its secondary OCR feature to be further refined before wide-scale use as a 
reliable OCR application. Based on the findings for this application, the following scores 
were given: 

• Accuracy: 1 

• Consistency: 1 

• Speed:1 

• Ease of Use: 2 

• Functionality: 2 

The accuracy rate of the application was poor and inconsistent. The application is 
capable of conducting OCR quickly; however, overall speed suffers based on the amount 
of errors that need to be corrected by the user. The built-in spellchecker functionality can 
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aid the user in correeting these errors and offsets some of the speed penalties. Using and 
understanding the functionality of the program can be accomplished in 10-15 minutes. 
The application provides the most input file types but has limited output options. 

5. Nuance OmniPage® OCR 

Nuance OmniPage® was the last application tested. When the software loads, the 
user is presented with a menu to choose what kind of conversion they would like to 
accomplish. Figure 38 illustrates this screen. 



Figure 38. OmniPage Start Screen 

While there are templates, known as “workflows,” the method used by this thesis 

was the “open fde” option. Once this option is clicked, a dialog box is presented that 

allows the user to select what documents they would like to OCR. An important finding 

is that the application allows the user to select multiple input documents of various 

formats at one time. For example, you can select a PDF, an image, and another PDF all at 
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the same time. Likewise, you ean seleet a PDF and it will import all pages of the PDF. 
After the files have been seleeted and imported into the program, the user is presented 
with a workspace view. The workspace view has multiple frames and allows the user to 
rearrange their frames as they please. This view is presented in Figure 39. 



Figure 39. OmniPage® Workspace 

The three main frames of the application are a thumbnail screen (left), a page 
image screen (middle), and a text editor screen (right). Before OCR has been conducted, 
only the thumbnail screen and page image screen contain information. In order to start 
processing the document, the user must click a button aptly named “Start Processing.” 
Once the button is clicked, OCR is performed on the input documents, the text editor 
screen is populated to show the output, and a proofreading screen appears to walkthrough 
the document. Figure 40 represents the state of the application once the “start processing” 
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button has been clicked. For clarity, Figure 41 illustrates the proofreading screen 
separately. 



Figure 40. OmniPage® Workspace post-OCR 
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Figure 41. OmniPage® Proofreader 

The proofreader screen allows the user to walkthrough the document to verify two 
cases: spelling and suspected inaccuracy. If the program encounters a word not in its 
dictionary and is considered a misspelling, “MCO” for example, it allows the user to 
change it or add it to the dictionary. Adding the word to the dictionary eliminates the 
need to correct the word later in the document or in future documents. If the program 
encounters a word that it suspects is an inaccurate transcription, it will prompt the user to 
enter the correct entry. Both of these cases are handled at the same time on a line-by-line 
basis. To aid the user, the application shows the original entry in the top box and allows 
the user to retype the information in the middle box. Once the user has reviewed and 
corrected the OCR output, he must click the button “save to files.” This opens a dialog 
box and prompts the user to enter a filename and desired output file type. Figures 42 
through 45 illustrate the OCR output produced by OmniPage®. 
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DEPARTMENT OF THE NAVY 
H EADQJ ART E R S U H IT ED ST AT ES MAR I H E COR PS 
WASmMGTOM. DC 


MARIHE C'DRPS QRIJER 3 010. IE 


MOO a 010.IE 
0 332 
15 5pr 37 


From: CoimiHr.darjiT of the Mar±r.6 Corps 

To: DistributioTs List 

Stbj: CIASS PIANHIIC ES.CrOES FOR FLEET tIARIHE FORCE OOtIBAT 

OPERATICMS 


Ref: (a) Marire Corps Croimd Ajueutj it ion "War Materiel 

Reqtdrement C^IR^ Determination (1335-133'£) Stisd^ 
Final Report (NCTAl} 

(b) MOO P4400.33S (NOTAL} 

(cj FMFM 4-1 
(d> FM 3-e 
(ej FM 3-13 


Enel: (IJ Ejiplanation of the Seenario-Ease Combat Plantiing 

Factors Tables 

(2) Inf antr^^-Eeav;^" Threat Combat ?1 arming Factors Table 
(3J A.rmor-Eeav 7 Thmeat Combat Plarming Factors Table 
(4) Composite Combat Plarming Factors Table 
(5} Combat Plarmjing Factors for Special Operations 


(£} Artillery Ancillar;^ Items 


1 ■ Pur DO se . To promnlgate Class combat plarmjing factors (CPF'^sl to 

snpport Fleet Marine Force (FMF} combat operations. 

2 . Cancellation . MCO 3010.ID. 

3 . Eactground . Reference (aj reports the results of the Marine Corps Class 

Stndy^ (1335-133£) . Reference (b) establishes Marine Corps policy 
governing reqtii reraents dete rmin ation ^ aegnasition^ management^ and 
distr ibetion 

war reserve materiel. References (cl ^ (d) and (el provide logistical 

doctrine and associated tactics^ techmiiqces^ and procediires for 
Class KiW} support during combat operations. 

A . Plarming Factors. Factors to be nsed listing initial plarmaing for coii±jat 
operations are explained in enclosure (11 and shown in enclose res (21 through 
(51 - 


a . CP F" s reflect the ant icipated expenditure o f ground ammun it ion ove r 
designated time periods of combat operations. These rates represent the 
unconstrained requirement. "Ohconsrtrained requirements are based on approved 
force structure^ weapon mix^ anticipated duration of comb at and the 
anticipated intensity of conflict. Once Version 2.1 of the Ajnaunition 
P repo sit ionimg and Plarmjing System (APPSl is fielded^ the APPS 

D ISTRIEUTIOH STATEMENT A: Approved for public release; distribution is 
unlimited. 


Figure 42. Post-OCR Results of EX-1 using OmniPage® (after [2]) 
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Figure 43. Post-OCR Results of EX-2 using OmniPage® (after [2]) 
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MCO .SO 10. IE 


TABLE 2. COMBAT PLANNING FACTORS FOR SPECIAL OPERATIONS 

QUANTITY PER 

DODIC NOMENCLATURE_MEUtSOCi 

AOll CTG, 12 GA 00 BUCK 4S0 

A014 CTG, 12 GA 7.5 SHOT 4S0 

A023 CTG, 12 GA SLUG 250 

A024 CTG, 12 GA LOCKBUSTER 300 

A136 CTG, 7.62MM MATCH 460 

A260 CTG, 9MM OHP 18,000 

A363 CTG, 9MM BALL 18,000 

A475 CTG, .45 CAL BALL 6,000 

AX14 SHOTGUN PRIMER 1,000 

DtjBS DIVERSIONARY CHARGE MK 141 MOD 0 500 

L302 SIG CTG, WHITE FLARE 350 

L304 SIG CTG, GREEN FLARE 350 

L32S SIG CTG, RED FLARE 350 

LKll SIGNAL ROCKET LAUNCHER KIT 50 

M031 CHG, DEMO, BLOCK, TNT 1/2 LB 192 

M032 CHG, DEMO, BLOCK, TNT 1 LB 96 

MO 3 9 CHG , DEMO, CRATERING 20 

M130 CAP, BLAST ELEC 288 

M131 CAP, BLAST NON-ELEC 480 

M456 CORD, DETONATING, REIN 2,000 FT 

M670 FUZE, BLASTING, TIME, EXPLOSIVE LOADED 800 FT 

M766 IGNITER, TIME. BLASTING FUZE 300 

M960 CHG , DEMO , EXPLOST/E SHEET 38 FT PER ROLL 2 RO 

M981 CHG, DEMO, EXPLOSIVE SHEET 25 FT PER ROLL 2 RO 

M9S2 CHG, DEMO, EXPLOSIVE SHEET 19 FT PER ROLL 2 RO 

ML03 FIRING DEVICE, DEMO, MULTIPURPOSE Ml24 112 

MM30 CHG, DEMO 20 GRAMS 200 

MM41 CHG, DEMO, SHAPED, FLEX LIN 30 GR/FT, 6 FT LEN-GIH 12 

MM42 CHG, DEMO, SHAPED, FLEX LIN 40 GR/FT, 6 FT LENSTH 32 

MM43 CHG, DEMO, SHAPED, FLEX LIN 60 GR/ET, 6 FT LEN-GTH 12 

MM44 CHG, DEMO, SHAPED, FLEX LIN 75 GR/FT, 6 FT LEN-GTH 28 

MM45 CHG, DEMO, SHAPED, FLEX LIN 100 GR/FT, 6 FT LENCTH 12 

MM46 CHG, DEMO, SHAPED, FLEX LIN 225 GR/FT, 6 FT LEN-GTH 12 

MM47 CHG, DEMO, SHAPED, FLEX LIN 400 GR/FT, 6 FT LENGTH 12 

MM48 CHG, DEMO, SHAPED, FLEX LIN 600 GR/FT, 6 FT LENCTH 12 

MM5 6 DUAL LEAD NONEL PRIMADET, 175 MS DELAY, 10 0 FT LENGTH 7 5 
MN14 FIRING DEVICE HAND HELD MK 54 5 


ENCLOSURE [5} 

Figure 44. Post-OCR Results of EX-3 using OmniPage® (after [2]) 
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MCO eOlD.IE 


ARTILLERY ANCILLARY ITEMS 


APR 15 1997 


PHn.l FCTM F 

jLur'ii 1 anv jtpu 

IXJDIC 

Nome n-c latu re 

DO Die 

Nomenclature 

Mulitplier 

D501 

PROJ 155HH, ADAH-L, H692 

D533 

CHG PROP 155HH, RB.'WB, M119A1; 

0.27 



0540 

CHG PROP 155HH, GREEN BAG, H3 

0.15 



D541 

CHG PROP 155HH, WHITE BAG, H4 

0.6S 



N233 

FUZE, ET, M762 

1.05 




(N2S5, FZ, MTSq, H577 may Bubstitute) 




N523 

PRIMER, PERCUSSION, M82 

1.1 

D502 

PROJ 155HH, ADAH-S, H731 

D533 

CHG PROP 155MM, RB'WB, M119A1; 

0.27 



D540 

CHG PROP 155MM, GREEN BAG, M3 

0.15 



0541 

CHG PROP 155MM, WHITE BAG, M4 

0.6S 



N23S 

FUZE, ET, M762 

(N2S5, FZ, MTSq, M577 may Bubstitute) 

1.05 



N523 

PRIMER, PERCUSSION, M82 

1.1 

0505 

PROJ 155MM, IILLUM, M435A2 

0533 

CHG PROP 155MM, RB^WB, M11SA1I 

0.27 



0540 

CHG PROP 155MM, GREEN BAG, M3 

0.15 



0541 

CHG PROP 155MM, WHITE BAG, M4 

0.03 



N2SS 

FUZE, ET, M762 

(N2S5, FZ, MTSq, M577 may substitute) 
(N24S, FZ, MT, M565may substitute) 

1.05 



N523 

PRIMER, PERCUSSION, MS2 

11 

D510 

PROJ 155MM, COPPERHEAD, H712 

0533 

CHG PROP 155MM, RB/WB, M119A1; 

0.27 



0540 

CHG PROP 155MM, GREEN BAG, M3 

0.15 



D541 

CHG PROP 155MM, WHITE BAG, M4 

0.03 



N523 

PRIMER, PERCUSSION, Me2 

1.1 

D514 

PROJ 155HH, RAAH-L 

D533 

CHG PROP 155MM, RB'WB, M119A1; 

0.27 



D540 

CHG PROP 155MM, GREEN BAG, M3 

0.15 



D541 

CHG PROP 155MM, WHITE BAG, m 

0.03 



N239 

FUZE, ET, M762 

(N2S5, FZ MTSq, M577 may substitute) 

1.05 



N523 

PRIMER. PERCUSSION. MS2 

1.1 

D515 

PROJ 155HH, RAAH-S 

D533 

CHG PROF 155MM, RB'WB, M119A1; 

0.27 



D540 

CHG PROP 155MM, GREEN BAG, M3 

0.15 



D541 

CHG PROP 155MM, WHITE BAG, M4 

0.03 



N2S9 

FUZE, ET, M762 

(N2S5, FZ, MTSq, M577 may substitute) 

1.05 



N523 

PRIMER, PERCUSSION, MB2 

11 

D52& 

PROJ 155HH, SMOKE, WP, MB25 

0532 

CHG PROP 155MM, RED BAG, M203 

0 2 



D533 

CHG PROP 155MM, RB/WB, M119A1; 

0.2 



D541 

CHG PROP 155MM, WHITE BAG, M4 

0.7 



N239 

FUZE, ET, M762 

(N2S5, FZ, MTSq, M577 may substitute) 

1.05 



N523 

PRIMER. PERCUSSION. MB2 

1 1 


ENCLOSURE (6) 

Figure 45. Post-OCR Results of EX-4 using OmniPage® (after [2]) 


Table 5 illustrates a summary of the aeeuracy rate and important findings for 
OmniPage®. 
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Page 

Total 

Errors 

Word 

Errors 

Number 

Errors 

Cause / Findings 

EX-1 

0 

0 

0 

• Recreated near-perfectly. 

EX-2 

14 

7 

7 

• Text converted from portrait view to 
landscape. 

• Data placed in a table data structure. 

EX-3 

4 

3 

1 

• Data placed in a table data structure. 

EX-4 

5 

2 

3 

• Near-perfect table recreation. 

Totals 

23 

12 

11 



Table 5. OmniPage® Results and Findings 


In general, OmniPage® aeeurately transeribed eaeh input file. In most cases, the 
application created fully modifiable outputs that were near-duplicates of the input files. 
While the program did encounter errors, no specific trends appeared. When an error was 
encountered, it was corrected with the proofreader. The application was able to detect and 
represent data in different views—landscape and portrait, and also created very accurate 
and defined tables. Based on the findings for this application, the following scores were 
given: 

• Accuracy: 3 

• Consistency: 3 

• Speed: 3 

• EaseofUse:! 

• Functionality: 3 

OmniPage® had the highest accuracy rate of all the applications. It consistently 
produced the same results. The applications OCR speed and proofreader allow the user to 
quickly review and correct the document. Using and understanding the basic functionality 
of the program can be accomplished in approximately one hour. Understanding the 
advanced functionality of the program can be accomplished in 2-3 hours. The application 
has fewer input file types when compared to OneNote®, however, it has the most output 
types of all three applications. 
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6. OCR Summary 

Table 6 illustrates a summary of the scores given to all the applications. 



Accuracy 

Consistency 

Speed 

Ease of 

Use 

Functionality 

Totals 

Onlineocr.net 

2 

3 

1 

3 

1 

10 

OneNote® 

1 

1 

1 

2 

2 

7 

OmniPage® 

3 

3 

3 

1 

3 

13 


Table 6. OCR Summary Scores 


In general, the OmniPage® software out-performed the other applications and 
received the highest score. Not only did it have the highest accuracy rate, it provided the 
most functionality—spellcheck, native text editor, and the most output formats. Based on 
these findings, OmniPage® was used to create the text files that were later used by the 
extraction programs. 

The second-best application was the open-source application. Although the 
application proved to be accurate, intelligent, and quickly leamable, it is limited by its 
page-per-hour restriction and number of input and output formats. 

OneNote® was least-favored because it produced highly inaccurate and 
inconsistent results. While the program provides spellcheck functionality and a vast array 
of input formats, it has limited output options and requires heavy user-involvement to 
correct the OCR outputs. 

B. PROGRAM DEMONSTRATION 

After the OCR comparison was conducted, two programs were created to extract 
the text-based consumption data text from the text files produced by OmniPage®. The 
goal of the first program was to automate the process of data extraction by using pre¬ 
defined decision-making logic. While the program strives for automation during the text 
extraction phase, user interaction is necessary at the end to verify the outputs. The goal of 
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the second program was to involve the user in every decision. Since it had no pre-defmed 
logic statements, the responsibility for deciding whether a consumption data element was 
true and accurate was placed on the user. These programs were created with one 
assumption: the input document was in an acceptable format for the application and free 
of errors. Thus, the programs have been given “perfect inputs” which allows the testing to 
focus solely on data extraction. 

The automated program was tested first in a two-phase process. During the first 
phase, each page was examined separately to determine what unique characteristics 
existed to distinguish desired elements from superfluous information. Once the unique 
characteristics (if any) had been identified, the program was written to detect them and 
successfully extract their contents. Some of the input documents followed regulated 
correspondence procedures, allowing the leveraging of some of their suitable 
characteristics. During the second phase, all five of the input documents were placed into 
one document for the application to process as a whole. This tested the ability of the 
automated application to work as designed. After the automated application was tested, 
the walkthrough application was designed and tested as two versions: line-by-line and 
page-by-page. 


1. Automated Program 

a. File Input and Closing 

In order to begin extracting consumption data out of the input files, the program 
first opens an input file, reads each line into a list, and then closes the input file. While 
this part of the program produces no output, it handles necessary application overhead to 
start the process. Figure 46 illustrates the coding of the file open and close function. 
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# This module handles opening of the input file, 

# reading in of inputs, and closing of the file. 

rawData = [] # Holds the original lines 

lineCount = 0 
rawDataSize = 0 


def fileOpenRoutine(fileName): 
global lineCount 
global rawDataSize 
lineCount = 0 
rawDataSize = 0 
file = open(fileName, 'r') 
for line in file: 
line = line.stripO 
if (line 

rawData.append(line) 
lineCount = lineCount + 1 
rawDataSize = rawDataSize + 1 
file.closeO # Close the input file 


# Open the file 

# Strip white space 

# Remove blank lines 

# Add it to the list 


#- 


Main 


-# 


print ("Input filename to analyze:") 
fileName = input() 
fileOpenRoutine(fileName) 


Figure 46. Automated Program: File Open and Close 


The program begins by asking the user to input a filename. Once the input has 
been given, the program calls the first function: fileOpenRoutine(fileName). This function 
takes the name of a fde entered by the end user, opens it, reads in each line from the fde, 
records the number of lines, and closes the fde. To aid later data extraction and 
readability, white space is stripped off the beginning and end of each sentence and blank 
lines are removed. 
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b. Detect and Extract Document Information 

Now that the input file has been read, the program begins by identifying and 
storing the important identifying information of each document: name, date of publish, 
subject, etc. Figure 47 illustrates the coding of this function. 


identifierinfo = [] # Identifer Information 

counter = 0 

def handleMarineCorpsOrderldentifyinglnformation(rawDataList): 
counter = 9 
subj ="" 

identifierlnfo.append(rawData[6]) # Line 6 should be Marine Corps Order and # 
identifierlnfo.append(rawData[5]) # Line 5 should be the date 

while "Ref not in rawData[counter]: # Subj starts on line 9 and ends when it finds Ref 
if (subj == 

subj = rawData[counter] 
else: 

subj = subj + "" + rawData[counter] # Process a multi-line subject 
counter = counter + 1 

identifierlnfo.append(subj) 

#-Main-# 

if "MCO" in rawData[3]: # We look for the indication of a Marine Corps Order 

handleMarineCorpsOrderldentifyinglnformation(rawData) 


Figure 47. Automated Program: Detect and Extract Document Info 


This part of the program begins with a logic test, “if ‘MCO’ in rawData[3] ... 
handleMarineCorpsOrderIdentifyingInformation(rawDataf' conducts a logical test to see 
if the string “MCO” is present in the third line of the document. Due to the template- 
based nature of the document, the test is passed, and the program executes the function to 
handle a MCO: handleMarineCorpsldentifyinglnformation(rawDataList). It is important 
to note that the input document was written in 1997 and may not represent the format of a 
current MCO. Thus, it is important for a final and fully-implemented application to 
follow the most current standards and policies. 
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The function handleMarineCorpsOrderIdentifyingInformation{rawDataList) 
handles the identification of the document name, date it was published, and subject of the 
document. Since a MCO follows standard correspondence procedures and the program 
was given perfect input, these data fields were extracted out of the document by using 
their exact line numbers. For example, MARINE CORPS ORDER 8010.IE resides on 
line six, the date of the document resides on line five, and the subject of the document 
begins on line nine and ends when the first occurrence of “Ref:” is encountered. It is 
common for the subject line to span several lines, requiring the program to link 
(concatenate) the lines together in order to accurately present the subject field. Testing for 
“Ref:” also represents one of the first major problems with implementing an automated 
program. Testing for the presence of this exact string had to be done in order to stop the 
program from entering an endless loop. Since the program was given perfect inputs, this 
was not a problem. However, if the user who verifies the OCR output makes an error and 
allows a different string such as “ref:” to go through, it would cause this particular 
program to crash. 

While this function handles the detection and extraction of the identifying 
information for the front page of a MCO, it can be used as a template function to handle 
other documents: field manuals, technical manuals, etc. Figure 48 illustrates the output of 
this function once the program successfully detected and extracted the document’s 
identifying information. 


Input filename to analyze: 
pagel.txt 

Document: MARINE CORPS ORDER 8010.1E 
Date: 15 Apr 97 

Subj: CLASS V(W) PLANNING FACTORS FOR FLEET MARINE FORCE COMBAT 
OPERATIONS 


Figure 48. Detect and Extract Document Info Output 
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c. 


Detect and Extract Table Information 


This part of the program is responsible for detecting tables in the documents and 
extracting their contents. Figure 49 illustrates the coding of this part of the program 
(vl.O), which tests for the presence of the “Infantry-Heavy Threat Combat Planning 
Factors Table.” In the example given, the function handleDataStructures(rawData) is 
called by the program after the file has been read and the document’s identifying 
information has been recorded. By studying the input document, we know that the string, 
“Infantry-Heavy Threat Combat Planning Factors Table,” represents a table inside the 
document that contains lines of consumption data. In order to begin data extraction from 
the table, a logical test is conducted first to see if the table is present in the document: “if 
‘Infantry-Heavy Threat Combat Planning Factors Table’ in rawData[counter]” is tested 
on each line from the input document. After the program detects the table, it reads lines 
of data into an intermediate data structure, “tableData,” until the next occurrence of the 
word “Table” is encountered. This had to be done for the same reason for which the 
occurrence of “Ref:” was tested. Due to the current format of the input document, no 
distinguishable landmarks existed for the program to stop. The only way to get the 
program to stop was by testing for the presence of a new table. Again, should the word 
table not exist or be spelled incorrectly; the program would enter an endless loop, 
requiring user intervention. 

By reading the table contents into the secondary data structure “tableData” we can 
further isolate the inputs for extraction. Another counter, “tmpCounter,” is used to 
prevent the main program counter, which controls position, from being adjusted, saving 
the correct position in the main program until the secondary extraction process has 
completed. Further leveraging known information from the input document, we know 
that the DODIC has a length of five characters. Thus, the logical test 
“if(len(tableData[tmpCounter]) == 5)” is used to indicate when a new line of data is 
encountered. Once a new line of data is encountered, the previous line of data, 
“dataLine,” is printed to the screen. Note: some of the logic tests created for this program 
may cause false positives. For example, if a consumption data element is five characters 
and not a DODIC, the program would attempt to extract the data based on a false 
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positive. However, using these tests became necessary because no other distinguishable 
tests could be created based on the format of the input document. Thus, in order to reduce 
complexity of the code and minimize false positives, new standards for the input 
documents may be needed. Furthermore, while the logic test checks for the presence of 
the “Infantry-Heavy Threat Combat Planning Factors Table,” testing can be done on 
other inputs. For example, placing “Infantry-Heavy Threat Combat Planning Factors 
Table,” “Armor-Heavy Threat Combat Planning Factors Table,” and the “Composite 
Combat Planning Factors Table” in a list data structure could be used to repetitively test 
for table existence. By using this approach, a “consumption data dictionary” that contains 
known occurrences of table names could be created. Over time, this dictionary could 
track all known occurrences of tables and provide template-based formatting for their 
extraction. For example, if the program were to recognize and detect the presence of an 
“ammunition table” and have a pre-defmed understanding that this table was 10 lines of 
data, the program could locate the table and extract 10 lines of data. This would help 
alleviate some of the hard-coded complexities in previous examples. Figure 50 illustrates 
the output of this function (vl.O). 
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def handleDataStructures(rawData): 
global rawDataSize 
counter = 0 
tmpCounter = 0 
tableData = [] 
tableDataSize = 0 
dataLine ="" 

while (counter < rawDataSize): 

if "Infantry-Heavy Threat Combat Planning Factors Table" in rawData[counter]: 
counter = counter +18 # Move past table header and column headers 
while "Table" not in rawData[counter]: 
tableData.append(rawData[counter]) 
tableDataSize = tableDataSize + 1 
counter = counter + 1 

if (counter == rawDataSize): # There are no more inputs, prevent 
break # reaching out of bounds 

while (tmpCounter < tableDataSize): # Construct each line of data 
if (dataLine ==""): 
dataLine = tableData[tmpCounter] 
tmpCounter = tmpCounter + 1 

while (True): 

if (tmpCounter >= tableDataSize): 
break 

dataLine = dataLine + "" + tableData[tmpCounter] 
tmpCounter = tmpCounter + 1 
if (tmpCounter >= tableDataSize): 
break 

if (len(tableData[tmpCounter]) == 5): 
print (dataLine) 
dataLine ="" 

tmpCounter = tmpCounter -1 
break 

tmpCounter = tmpCounter + 1 
counter = counter + 1 


Figure 49. Automated Program: Handle Data Structures (vl .0) 
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Input filename to analyze: 

pagell.txt 

Found Table: 

Infantry-Heavy Threat Combat Planning Factors Table 
Known column headers for this table: 

Weapon Ammunition GCE RATES Other than GCE Rates 

Weapon ID Nomenclature DODIC Nomenclature Daily Daily Basic Daily Daily Basic 

ASSAULT SUSTAIN Allowance ASSAULT SUSTAIN Allowance 
B0471 SQUAD DEMOLITION SET M032 CHARGE, DEMO BLOCK 1 LB TNT 15.00753 3.39605 48 15.00753 
B0471 SQUAD DEMOLITION SET M130 CAP, BLASTING ELECTRIC 18.01945 4.00000 150 18.01945 2.03084 
B0471 SQUAD DEMOLITION SET M131 CAP, BLASTING NON-ELECTRIC 60.00000 36.00000 260 60.00000 
B0471 SQUAD DEMOLITION SET M456 CORD, DETONATING PETN 1401.30498 340.45345 1500 
B0471 SQUAD DEMOLITION SET M670 FUZE, BLASTING TIME 500.00000 216.00000 3000 380.00000 
B0471 SQUAD DEMOLITION SET M757 CHARGE, ASSEMBLY DEMOLITION 15.00753 3.39605 50 15.00753 


Figure 50. Handle Data Struetures Output (vl.O) 


To illustrate how new poliey standards ean help reduee the amount of eoding and 
overall eomplexity of the program, two landmarks were inserted before and after the 
table. The phrases “Begin Table” and “End Table” were plaeed in the input doeument as 
wrappers around the table. The program was then modified to deteet these phrases and 
eonduet data extraetion. These ehanges are illustrated in Figure 51. 


d ef h a ndle DstaStructu re s(raiw Data ) : 
globai rawDataSfze 
counter = 0 
table Data = [] 
table Data Size = 0 
while (coun-ter < rawOataSiie): 
if ” Begi n Ta b I e" i n ra w Data [c o u nte r]: 
while "End Table" not in rawData[counter3: 
t a ble Data. a ppe n d [ ra wOata [co u nte r] ) 
tableDataSize =tableDataSize + 1 
counter = counter + 1 

Figure 51. Automated Program: Handle Data Structures (v2.0) 


By placing these phrases into the input file, we were able to drastically reduce the 
complexity of the program. Rather than creating programs that must handle very specific 
tests such as detecting the string “Infantry-Heavy Threat Combat Planning Factors 
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Table,” refinement of the input allows more general usage, regardless of table name and 
input document type. 

d. File Output 

This part of the program focuses on placing the consumption outputs into a file. 
While this program focuses on output to a file, the output produced can be interpreted 
as a (key, value) pair. This key value would take the following form: a string key 
consisting of the documents identifying information and a list value that would consist 
of each line of consumption data. Thus, it would be (string identifyinginformation, list 
consumptionDataLines). Although this program concentrates on writing outputs to a file 
at the end of the document’s processing, the outputs could be written to a file as the 
program works through each consumption table. For example, when a table is 
encountered, the entire table is written to the file, the data structure that holds the table 
information is then cleared, and the process is free to move to the next table, allowing 
repetitive use of the intermediate data structure. Figure 52 illustrates the coding of this 
function. 


def writeToFile(fileName, table): 

outputFile = open (fileName, V) 

for element in table: 
outputFile.write(element + "\n") 

writeToFileC'outputs.txt", table) 


Figure 52. Automated Program: File Output 

While this function produces no visual output, it places all of the contents in the 
provided table into the provided filename. The contents of this table would consist of all 
the consumption elements extracted from the input document. 
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e. Phase Two Test 

In order to test the automated program for phase two, all of the input files were 
consolidated into one file named “consolidatedinputs.txt.” The program was slightly 
modified to search for all occurrences of the tables inside of the consolidated input file. 
Again, it became necessary to create unnecessary tests based on false-positives caused by 
the format of the input document. The exact string “Infantry-Heavy Threat Combat 
Planning Factors” was found on the first page of the input document in the enclosure 
section. The program began to extract data from this point, all of which was incorrect. “If 
((counter > 90) and (“Infantry-Heavy Threat” in rawData[counter]))” represents the logic 
test that had to be created in order to access the table at the correct position. This could 
have been avoided by giving the program only the sections of the document that 
contained the tables. However, this requires more user interaction. Additionally, the 
problem of determining when a table had ended presented itself again. In order to stop 
extracting data for a particular table, the program had to check for the presence of the 
next table. Using the “Begin Table” and “End Table” changes, as suggested earlier, 
would have prevented us from having to create these unnecessary tests. Should these 
tables not exist or their names be misspelled, this program would enter an endless loop. 
Figure 53 illustrates a program that was capable of extracting consumption data elements 
from all the tables in the consolidated input file. 
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tablel = [] 
table2 = [] 
tables = [] 

while (counter < rawDataSize): 

if ((counter > 90) and ("Infantry-Heavy Threat" in rawData[counter])): 
tablel.append(rawData[counter]) 
counter = counter + 1 

while "COMBAT PLANNING FACTORS FOR SPECIAL OPERATIONS" not in rawData[counter]: 
tablel.append(rawData[counter]) 
counter = counter + 1 

if "COMBAT PLANNING FACTORS FOR SPECIAL OPERATIONS" in rawData[counter]: 
table2.append(rawData[counter]) 
counter = counter + 1 

while "ARTILLERY ANCILLARY ITEMS" not in rawData[counter]: 
table2.append(rawData[counter]) 
counter = counter + 1 

if "ARTILLERY ANCILLARY ITEMS" in rawData[counter]: 
tables.append(rawData[counter]) 
counter = counter + 1 

while len(rawData[counter]) > 0: 
tables.append(rawData[counter]) 
counter = counter + 1 

if counter >= rawDataSize: 
break 

counter = counter + 1 


Figure 53. Automated Program: Handle Consolidated Input File 


/. Automated Program Summary 

The automated program takes only one user input to begin proeessing—a 
filename. Onee the filename has been entered, the program exeeutes automated analysis 
of the file using pre-built logie tests. Onee the program has finished its analysis, the user 
must verify eaeh data element before it is written to the output file. This part of the 
program is not illustrated beeause it is very similar to the program presented in the next 
seetion. 
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While the goal of this program is to achieve automation, it requires extensive 
testing, logic creation, and still requires user interaction once the program has 
autonomously extracted out all the possible consumption data elements to review the 
results. Some of the logic tests used by this program are highly unnecessary and may 
ultimately cause the program to fail. For example, the test “If ((counter > 90) and 
(“Infantry-Heavy Threat” in rawData[counter]))” would be necessary unless user 
selection of the input data is given or refinement of the input document occurs. While the 
goal of creating these tests was to increase the accuracy of the output, they increase the 
complexity and length of the code, require additional processing power, and may cause 
the program to perform slower or more inefficiently. Additionally, they may only work 
with very specific inputs. In order to mitigate this problem, refinement of the input 
document is necessary. By creating unique identifiers such as “Begin Table” and “End 
Table,” the complexity and length of the program can be reduced while also allowing it to 
accept a larger variety of inputs. 

2. Walkthrough Programs 

While the automated program makes use of several functions, this program only 
uses one. Instead of extracting the document’s identifying information and consumption 
elements separately, this program allows the user to step through each line of the input 
document in sequential order, prompting the user to keep the line or disregard it. Thus, 
the need to separately process identifying i n f ormation and consumption elements can be 
done simultaneously since both will come through as lines of input. Using this approach, 
the input fde can be analyzed line-by-line or another increment: paragraph-by-paragraph, 
page-by-page, etc. Therefore, two approaches using this application are presented: line- 
by-line and page-by-page. As a reference point, a page is defined as 45 lines of input. The 
decision to use 45 lines is based on the standard MS Word® document format. A one- 
page document with one-inch borders can hold 45 lines of information in Times New 
Roman, 12-pitch font. 
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a. Line-by-line Program 

First, the input file is opened and each line of the document is read into a 
temporary data structure. Afterwards, the user is presented with one line of information at 
a time and is prompted to verify whether or not it is a data element. If, and only if, the 
user enters “yes,” the element gets placed into the final output data structure. Once the 
document has been reviewed and no inputs remain, the data structure that contains all the 
“yes” responses is then placed into an output file. Figure 54 illustrates the coding of the 
line-by-line program. Figure 55 illustrates a snapshot of the running application. 


validinformation = [] 

# Store the final data elements 

def manualWalkthrough(rawDataList): 


for element in rawDataList: 


print ("Is this valid information?") 


print (element) 


response = input() 


if (response == "yes"): 


validinformation.append(element) # If yes, add it to the list 

manualWalkthrough(rawData) 

# Call the function 


Figure 54. Line-by-line Program (coding) 


Is this valid information? 

DEPARTMENT OF THE NAVY 
yes 

Is this valid information? 

HEADQUARTERS UNITED STATES MARINE CORPS 
yes 

Is this valid information? 

WASHINGTON, DC 20380-0001 
no 

Is this valid information? 

MCO 8010.1E 


Figure 55. Line-by-line Program (running) 

The main strength of this approach is its simplicity. The main weakness of this 
approach is its lack of functionality. As long as there is only one valid consumption 
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element per line in the input doeument, an end-user ean quiekly and aeeurately walk 
through the doeument. However, if the eonsumption data element is split apart and spread 
aeross multiple lines, the program output may not make sense. In order to eorreet this, the 
input doeument must be further proeessed or a string eoneatenation proeedure must be 
ereated. While the program ean quiekly walk through eaeh line, the end user may find it 
faster to view multiple lines of information at onee with the ability to seleet speeifie lines 
or ranges. Thus, the page-by-page program offers a potential performanee inerease. 

b. Page-by-page Program 

This program is similar to the line-by-line program but aims to speed up the 
review proeess. Figure 56 illustrates the eoding of this program. 


def manualWalkthrough(rawDataList): 
counter = 1 
tmpCounter = 0 
while (True): 

print ("Please select the lines of valid infornnation by line #:") 
print ("Separate line #'s with a space - e.g. 1 3 15 31 44") 

while ((counter < 46) and (tnnpCounter < rawDataSize)): 
print(str(counter) + "." + "" + rawDataList[tnnpCounter]) 
tnnpCounter = tnnpCounter + 1 
counter = counter + 1 

response = input() 

responses = response.split("") # Split the inputs by their space 
for elennent in responses: 

validlnfornnation.append(rawDataList[tnnpCounter]) 
counter = 1 

if (tnnpCounter == rawDataSize): 
print ("Review connplete.") 
break 

nnanualWalkthrough(rawData) # Call the function 


Figure 56. Page-by-page Program (eoding) 
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The main strength of this approach is its speed. The main weakness is its need for 
structured and accurate input by the user. The program begins by prompting the user to 
enter specific line numbers that are deemed to be elements or consumption data. Before 
the user is able to enter input, 45 lines of data are presented to the user for review. The 
user must enter structured input (line numbers separated by white space) that corresponds 
to each correct line. Once the user input has been tied to corresponding lines, the 
respective lines are transferred to the final output data structure. Once all of the lines have 
been reviewed, the program terminates. An extension of the program would be the ability 
to go back through the final output list and modify and review the values as a second 
layer of precaution. While the goal is to speed up the process, it places more 
responsibilities and requirements on the end user. Unless failure logic is added, 
inappropriate responses or errors in the input will cause the program to fail or raise 
exceptions. Figure 57 illustrates a running instance of the program. 


Please select the lines of valid information by line #: 

Separate line #'s with a space - e.g. 1 3 15 3144 

1. DISTRIBUTION STATEMENT A: Approved for public release; distribution 

2. is unlimited. 

3. MCO 8010.1E 
4.15 Apr 97 

5. can be used in conjunction with MAGTF II to determine Class V(W) 
requirements 

6. for operational plans. 

1234 

Output: 

1. DISTRIBUTION STATEMENT A: Approved for public release; distribution 

2. is unlimited. 

3. MCO 8010.1E 
4.15 Apr 97 


Figure 57. Page-by-page Program (running) 


3. Program Summary 

The first program concentrates on automated analysis with user interaction at the 


end of the automated data processing. While this program strives to achieve automation, 
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additional consumption document analysis and logic test creation is required. The 
creation of eomplex logic tests can be mitigated by further refinement of the input 
document. The second program concentrates on walkthrough analysis using two 
approaches: line-by-line and page-by-page. While the line-by-line program offers 
simplieity in its current state, it laeks funetionality and may be slower than the second 
approach. Using the page-by-page approach, speed is offered at the loss of simplieity, as 
well as added reliance on user correetness, allowing 45 pages of information to be 
reviewed at a time. Sinee consumption data may be “buried” at the end of a consumption 
document, this approach allows the end-user to quickly reach their desired position in the 
document. Regardless of which program is used, the emphasis remains on eorrecting and 
ensuring that the input doeument is free of errors and is in the most optimal format. This 
can be accomplished through multiple layers of internal review and the establishment of 
new standards that would govern all future eonsumption doeuments. 
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V. CONCLUSIONS 


A. LESSONS LEARNED 

1. Choosing OCR Application 

Chapter II discussed two OCR approaches: free-form and template-based 
recognition. Due to the nature of the consumption documents reviewed by this thesis, 
free-form analysis was chosen over template-based. Although template-based OCR can 
be faster than free-form, no feasible template opportunities presented themselves. 
Additionally, template-based OCR requires a new template be created for every instance 
of a table. Thus, free-form OCR is the most-preferred method for the current state of 
consumption documents. Should consumption data be presented in a template format in 
future standards or documents, template-based OCR may present itself as an opportunity. 

Chapter IV compared three OCR applications: an open-source, online application, 
Microsoft OneNote®, and Nuance OmniPage®. Of the three applications, the 
OmniPage® software offered the most-reliable functionality and the highest accuracy 
rate and is therefore recommended over the other applications. 

OmniPage® allowed for a simultaneous conversion and correction process, 
removing the requirement to separately correct the input document in another text editor 
after it had been converted. The open-source, online application was more accurate and 
more robust than OneNote®. Although the open-source application claims to be free, the 
number of pages that can undergo OCR are limited unless additional pages are purchased. 
Thus, while it may not be feasible to use the online application, the OCR conversion 
process could be out-sourced at first until native OCR capability is acquired or created. 
OneNote® proved to be unreliable, creating numerous spelling errors while not providing 
intelligent functionality for creating tables or data structures. 

2. Choosing Application Approach 

Chapter IV compared two programs created to extract consumption information 
out of input documents: automated and walkthrough analysis. The walkthrough analysis 
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program was further subdivided into two approaches: line-by-line and page-by-page. 
While the automated program was able to autonomously extract consumption elements 
out of the input document, highly-complicated and restrictive parsing logic had to be 
created. Creation of this logic requires that the programmer understand and be familiar 
with the nature of the input document. Additionally, a programmer would need to see an 
example of every consumption document to ensure that they had correctly defined all the 
logic statements. Furthermore, creation of these restrictive tests should be minimized to 
ensure the application could be used in a wide range of environments. While the 
template-based nature of a consumption document such as a MCO allows the program to 
accurately extract the documents identifying information, it can also create problems that 
must be addressed. For example, consumption tables are commonly referenced as 
enclosures. Thus, logic had to be created to look past these occurrences. Additionally, 
user interaction must occur at the end of the program to verify that the program correctly 
interpreted and extracted all the possible data elements. 

Since the current state of consumption documents may require extensive logic test 
creation, the goal of the walkthrough analysis program was to circumvent this 
requirement. Allowing the user to walk through the input document in a line-by-line or 
page-by-page basis, the responsibility of verifying consumption data is placed on the end- 
user instead of the program. While both approaches allow the end-user to walk through 
the input document, the page-by-page program is preferred since it allows for rapid 
movement, selection, and verification. 

Regardless of which program is used, the need to refine and format the input 
document remained a central focus throughout the process. First, the input document 
should be converted to an appropriate input format (.docx, .txt, .PDF) for the extraction 
program. In this thesis, a text file with a .txt extension was used to keep the input very 
simple and to allow the native Python libraries to open the files. Once the document has 
been placed into the appropriate format, it should be reviewed and reformatted (spelling, 
table creation, etc.) in order to make the document easier to read for the end-user of the 
program. While many of these steps may be required due to the current state of 
consumption documents or the system as a whole, future standards and policies can be 
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dictated and followed to reduce the steps required in this process. Although the program 
focuses on output to a fde, it can be altered slightly to output a (key, value) pair. The 
decision as to what output to use is based on the storage approach used by the end-user. 
This part of the problem was left for future analysis and research. 

B. RECOMMENDATIONS 

This thesis recommends the following: 

• Establish a baseline listing and collection of consumption documents in 
one central location. In order to create a reliable automated application, 
the program should be aware of all known iterations of a consumption 
document. 

• Convert the input documents using OCR software that primarily focuses 
on OCR or outsourcing to achieve the desired input format. Although 
professional-grade software such as OmniPage®, isn’t free, it offers the 
highest accuracy and most functionality at a relatively low cost. 

• Refine and utilize a page-by-page walkthrough analysis program to extract 
and upload the first iteration of consumption data elements, using the 
baseline listing as a checklist. 

• Refine standard policies and correspondence procedures for the 
representation of consumption data in future documents. 

• Once the baseline has been established and refined policies have been 
created, refine and utilize an automated analysis program. 

Due to the current format state of consumption documents and the lack of a 
centrally - located baseline listing, implementation of an automated program would be 
ineffective. Thus, the baseline should be created using a walkthrough program and then 
gradually migrated to a point where an automated program can produce accurate results 
without unnecessary coding. 

C. FUTURE RESEARCH 

The following areas represent opportunities for future research and development: 

• Although commercial-off-the-shelf OCR technology was compared, a 
native OCR application could be researched and created. Conducting 
research in this field would require extensive background in computer 
vision, text analysis, and application coding and may require more than 
one thesis to fully-develop the application. 
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• While a “bare-bones” walkthrough analysis program was given, the 
creation and refinement of a walkthrough application might be 
accomplished in the scope of a single thesis. 

• Further advancement and creation of an automated program would be best 
suited for a single thesis. Should this avenue be pursued, it is 
recommended that the researcher have a background understanding and 
access to Marine Corps logistics documents and be proficient in 
application coding. 

• While the programs that are presented make use of safe coding practices, 
they do not focus on security vulnerabilities that may or may not be 
present. Thus, vulnerability testing could be conducted. 

• This thesis addresses how inputs can be placed into a database. It does not 
illustrate how this data can best be presented to the end user. Thus, data 
access and representation can be researched from a Human Computer 
Interaction (HCI) standpoint. 

D. SUMMARY 

In conclusion, this thesis has strived to provide the best picture for the way 
forward by conducting background research into OCR and presenting multiple 
approaches for tackling the analysis phase. The purpose of the examples given, problems 
identified, and recommendations presented is to support the Marine Corps advance 
towards automation of logistics consumption data and associated planning, allowing its 
focus to remain on winning the nation’s wars. 
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